ECML/PKDD 2004, Pisa, Italy, September 20-24, 2004
webmaster contact at
Photo by

Invited Talks

Algorithms for data mining: theory meets practice (in a chat room)


In this talk I will survey some of the work going on in Microsoft Research in the areas of machine learning and data mining. In particular, I will talk about privacy in public databases, statistical methods for data cleaning, and algorithms for large-scale spectral computations. I will also talk about our experiences with an artifact that raises concrete, fascinating questions in each of these three areas: the MSN Messenger "buddies network", consisting of approximately 100 million users and 2 billion edges.

Speaker: Dimitris Achlioptas (Microsoft Research, Redmond)

Dimitris Achlioptas received his B.Eng. in Computer Engineering from the University of Patras in 1993 and his M.Sc. and Ph.D. in Computer Science from the University of Toronto in 1995 and 1999. He subsequently joined Microsoft Research as a postdoctoral fellow, where he has been a research staff member since 2000. His research interests are centered around the interaction of random structures with computation. He has published in AAAI, FOCS, IJCAI, NIPS, PODS, SODA, STOC and other conferences, as well as JACM, JAMS, JCSS, SICOMP and other journals. He has served as program committee member for AAAI, FOCS, ICML, ICDM, RANDOM, SAT, SODA, WAW and other conferences. His recent work has included the analysis of large random networks and the use of randomization to accelerate algorithms in machine learning, information retrieval, and constraint satisfaction.

Home page:

Privacy and Data Mining


There is increasing need to build information systems that protect the privacy and ownership of data without impeding the flow of information. We will present some of our current work to demonstrate the technical feasibility of building such systems:
Privacy-preserving data mining. The conventional wisdom held that data mining and privacy were adversaries, and the use of data mining must be restricted to protect privacy. Privacy-preserving data mining cleverly finesses this conflict by exploting the difference between the level where we care about privacy, i.e., individual data, and the level where we run data mining algorithms, i.e., aggregated data. User data is randomized such that it is impossible to recover anything meaningful at the individual level, while still allowing the data mining algorithms to recover aggregate information, build mining models, and provide actionable insights.
Hippocratic databases. Unlike the current systems, Hippocratic databases include responsibility for the privacy of data they manage as a founding tenet. Their core capabilities have been distilled from the principles behind current privacy legislations and guidelines. We identify the technical challenges and problems in designing Hippocratic databases, and also outline some solutions.
Sovereign information sharing. Current information integration approaches are based on the assumption that the data in each database can be revealed completely to the other databases. Trends such as end-to-end integration, outsourcing, and security are creating the need for integrating information across autonomous entities. In such cases, the enterprises do not wish to completely reveal their data. In fact, they would like to reveal minimal information apart from the answer to the query. We have formalized the problem, identified key operations, and designed algorithms for these operations, thereby enabling a new class of applications, including information exchange between security agencies, intellectual property licensing, crime prevention, and medical research.

Speaker: Rakesh Agrawal (IBM Almaden Research Center)

Rakesh Agrawal is an IBM Fellow, whose current research interests include privacy technolgies for data systems, web technologies, data mining and OLAP. He leads the Intelligent Information Sytems Research (aka Quest project) at the IBM Almaden Research Center, which pioneered key data mining concepts and technologies. He has published more than 100 research papers and he has been granted 50 patents. He is the recepient of the ACM-SIGKDD First Innovation Award, ACM-SIGMOD 2000 Innovations Award, as well as the ACM-SIGMOD 2003 Test of Time Award. He was recently selected as one of the 2003 Scientific American 50, which recognized singular accomplishments of those who have contributed to the advancement of technology in the realms of science, engineering, commerce and public policy. He was singled for devising methods to preserve the privacy of information in large databases. He is also a Fellow of IEEE and a Fellow of ACM.

Rakesh Agrawal received the M.S. and Ph.D. degrees in Computer Science from the University of Wisconsin-Madison in 1983. He also has a B.E. degree in Electronics and Communication Engineering from the University of Roorkee, and a two-year Post Graduate Diploma in Industrial Engineering from the National Institute of Industrial Engineering (NITIE), Bombay. Prior to joining IBM Almaden in 1990, he was with the Bell Laboratories, Murray Hill from 1983 to 1989.

Home page:

Breaking through the syntax barrier: Searching with entities and relations


The next wave in search technology will be driven by the identification, extraction, and exploitation of real-world entities represented in unstructured textual sources. Search systems will either let users express information needs naturally and analyze them more intelligently, or allow simple enhancements that add more user control on the search process. The data model will exploit graph structure where available, but not impose structure by fiat. First generation Web search, which uses graph information at the macroscopic level of inter-page hyperlinks, will be enhanced to use fine-grained graph models involving page regions, tables, sentences, phrases, and real-world-entities. New algorithms will combine probabilistic evidence from diverse features to produce responses that are not URLs or pages, but entities and their relationships, or explanations of how multiple entities are related.

Speaker: Soumen Chakrabarti (Indian Institute of Technology, Bombay)

Soumen Chakrabarti received his B.Tech in Computer Science from the Indian Institute of Technology, Kharagpur, in 1991 and his M.S. and Ph.D. in Computer Science from the University of California, Berkeley in 1992 and 1996. At Berkeley he worked on compilers and runtime systems for running scalable parallel scientific software on message passing multiprocessors.

He was a Research Staff Member at IBM Almaden Research Center from 1996 to 1999, where he worked on the Clever Web search project and led the Focused Crawling project.

In 1999 he moved as Assistant Professor to Department of Computer Science and Engineering at the Indian Institute of Technology, Bombay, where he has been an Associate professor since 2003. In Spring 2004 he is Visiting Associate professor at Carnegie-Mellon.

He has published in the WWW, SIGIR, SIGKDD, SIGMOD, VLDB, ICDE, SODA, STOC, SPAA and other conferences as well as Scientific American, IEEE Computer, VLDB and other journals. He holds eight US patents on Web-related inventions. He has served as vice-chair or program committee member for WWW, SIGIR, SIGKDD, VLDB, ICDE, SODA and other conferences, and guest editor or editorial board member for DMKD and TKDE journals. He is also author of a new book on Web Mining.

His current research interests include question answering, Web analysis, monitoring and search, mining irregular and relational data, and textual data integration.

Home page:

Real-World Learning With Markov Logic Networks


Machine learning and data mining systems have achieved many impressive successes, but to become truly widespread they must be able to work with less help from people. This requires automating the data cleaning and integration process, handling multiple types of objects and relations at once, and easily incorporating domain knowledge. In this talk, I will describe how we are pursuing these aims using Markov logic networks, a representation that combines first-order logic and probabilistic graphical models. Data from multiple sources is integrated by automatically learning mappings between the objects and terms in them. Rich relational structure is learned using a combination of ILP and statistical techniques. Knowledge is incorporated by viewing logic statements as soft constraints on the models to be learned. Application to a real-world university domain shows our approach to be accurate, efficient, and less labor-intensive than traditional ones.

(Joint work with Parag and Matt Richardson.)

Speaker: Pedro Domingos (University of Washington, Seattle)

Pedro Domingos is an assistant professor in the Department of Computer Science and Engineering at the University of Washington. His research interests are in artificial intelligence, machine learning and data mining. He received a PhD in Information and Computer Science from the University of California at Irvine, and is the author or co-author of over 100 technical publications. He is associate editor of JAIR, a member of the editorial board of the Machine Learning journal, and a co-founder of the International Machine Learning Society. He was program co-chair of KDD-2003, and has served on numerous program committees. He has received several awards, including a Sloan Fellowship, an NSF CAREER Award, a Fulbright Scholarship, an IBM Faculty Award, and best paper awards at KDD-98 and KDD-99.

Home page:

Strength in diversity: the advance of data analysis


Although the origins can be traced back as far as one likes, the proper scientific analysis of data is really only around a century old. For most of that century, data analysis was the realm of only one discipline - statistics. In recent decades, however, as a consequence of the development of the computer, things have changed dramatically and now there are several such disciplines, including machine learning, pattern recognition, and data mining. Although all of these disciplines are concerned with extracting information from data, they have subtle differences in aims and emphasis. This paper looks at some of the similarities and some of the differences, noting where the disciplines intersect and, perhaps of more interest, where they do not. Particular issues examined include the nature of the data with which they are concerned, the role of mathematics, differences in the objectives, how the different areas of application have led to different aims, and how the different disciplines have led sometimes to the same analytic tools being developed, but also sometimes to different tools being developed. Some conjectures about likely future developments are given.

Speaker: David Hand (Imperial College, London)

David Hand is Professor of Statistics and Head of the Statistics Section at Imperial College London. He has published twenty books on statistics and related areas, including Discrimination and Classification, Analysis of Repeated Measures, Practical Longitudinal Data Analysis, Construction and Assessment of Classification Rules, Intelligent Data Analysis, Statistics in Finance, and Principles of Data Mining. He is a Fellow of the Royal Statistical Society and an Honorary Fellow of the Institute of Actuaries. He launched the journal Statistics and Computing in 1991, and also served a term of office as editor of Journal of the Royal Statistical Society, Series C. He was awarded the Thomas L. Saaty Prize for Applied Advances in the Mathematical and Management Sciences in 2001 and the Royal Statistical Society’s Guy Medal in Silver in 2002, and was elected Fellow of the British Academy, the UK’s leading learned society for the humanities and social sciences, in 2003. His research interests include classification methods, the fundamentals of statistics, and data mining, and his applications interests include medicine and finance. He has acted as a consultant to a wide range of organizations, including governments, banks, pharmaceutical companies, manufacturing industry, and health service providers.

Home page: