ECML/PKDD 2004, Pisa, Italy, September 20-24, 2004
News
webmaster contact at ecmlpkddweb@isti.cnr.it
Photo by lenzo.it

Tutorials

Monday 20 Friday 24
T1Evaluation in Web Mining  T3Distributed Data Mining for Sensor Networks 
T2Symbolic Data Analysis  T6Statistical Approaches used in Machine Learning 
T4Radial Basis Functions: An Algebraic Approach (with Data Mining Applications)  T7 Rule-based Data Mining Methods for Classification Problems in the Biomedical Domain  
T5Mining Unstructured Data 

T1 - Evaluation in Web Mining

September 20, 2004 - Polo Piazza Cavalieri - Aula Galilei

In conjunction with Statistical Approaches to Web Mining workshop.

Instructors

Bettina Berendt, Humboldt-University, Berlin

Ernestina Menasalvas, University of Madrid

Myra Spiliopoulou, University of Magdeburg

Abstract

Web mining has become a critical tool for competitive application intelligence. Understanding the behavior of a site's visitors requires creative extensions of KDD techniques for e-commerce and clickstream data: patterns must be discovered from a variety of data sources, and these patterns must be interpreted and transformed into actionable knowledge for redesigns that bring revenue. Redesigns encompass general improvements to information architecture and navigation options, as well as the offering of personalized recommendations and services. At the same time, a reliable discovery and interpretation of patterns cannot ignore the Web content itself. This leads to challenges on Web content mining, including text categorisation, content analysis and extraction of implicit semantics.

These issues are already broadly recognized: The research on Web mining is intensive and, in some cases, goes hand-in-hand with deployment in the market. This leads to the challenge of incorporating Web mining to the internal evaluation processes of the site operator. Web mining can be used to derive indicators that describe marketing success, the appropriateness of distribution channel mixes, or other indicators of a site's or service's success. At the same time, Web mining itself constitutes a major investment and therefore needs to be subjected to a cost-benefit evaluation. Both of these aspects, "Web mining for evaluation" and "Evaluation of Web mining" require systematic methods and a context of project management. The owners of Web sites and Web applications need a complete evaluation framework, in order to derive well-informed decisions for the extend of using Web mining as a tool for data analysis and for the deployment of its results in site and service design.

In this tutorial, we investigate the current state of Web mining evaluation from both viewpoints of evaluating a Web site and evaluating Web mining projects themselves. In particular, we address

  • methodologies to derive interesting patterns and success indicators from the data
  • integration of the derived patterns with the goals of the institution
  • problems related to the development of the web mining project in order to evaluate its success
  • computational and application-oriented frameworks for determining costs and benefits, and the role of different perspectives on "success";
  • project management frameworks for integrating these and other measures of application success, and for accommodating their respective strengths and weaknesses; and
  • infrastructures for deploying Web mining results.

The tutorial draws from the core domains of KDD, covering issues of data preparation, pattern discovery, and pattern analysis. We also draw on the domain of Web marketing that contributes the requirements and the economic measures, on human-computer interaction for user-centric success evaluation, and on project management dealing with gaps to be filled in order to evaluate the impact of a web mining project and having a measure of its success.

Tutorial web page

T3 - Distributed Data Mining for Sensor Networks

September 24, 2004 - Polo Piazza Cavalieri - Aula Da Vinci (b)

In conjunction with Knowledge Discovery in Data Streams workshop.

Instructor

Hillol Kargupta, University of Maryland, Baltimore County, Maryland, USA
hillol@cs.umbc.edu

Abstract

Advances in computing and communication over wired and wireless networks have resulted in many pervasive distributed computing environments. Sensor networks define one such class of distributed environments that offer many interesting possibilities in many data and labor intensive application domains. Monitoring forest fires, tracking moving targets, identifying unusual behaviors from vehicle data systems are some examples. Most of these sensor networks deal with distributed and constrained computing, storage, power, and communication-bandwidth resources. Mining in such environments naturally calls for proper utilization of these distributed resources. Moreover, in many ubiquitous applications privacy of the data is indeed an issue. For example, consider the problem of monitoring driving behavior in a commercial fleet using vehicle sensor networks. We must protect the privacy of the good drivers; but report the bad drivers in a fleet. A growing number of these applications deal with distributed data streams that require quick analysis and a quick response. Most off-the-shelf data mining systems are designed to work as a monolithic centralized application primarily from static data. They normally down-load the relevant data to a centralized location and then perform the data mining operations. This centralized approach may not always work well in many of the distributed, ubiquitous, and possibly privacysensitive data mining applications over sensor networks. Data centralization may cause massive drainage of power, increase response time, and make the overall architecture not very scalable. The field of Distributed Data Mining (DDM) offers an alternate choice. It pays careful attention to the distributed resources of data, computing, communication, and human factors in order to use them in a near optimal fashion. This tutorial will offer an introduction to the emerging field of Distributed Data Mining, specifically in the context of sensor networks. The attendees will be exposed to the following aspects of this field:

  1. An overview of DDM and sensor networks
  2. An overview of the existing DDM algorithms
  3. More detailed discussion of some important DDM algorithms that are appropriate for resource constrained sensor network applications.
  4. An overview of the systems research issues in DDM
  5. Detailed case study of an existing sensor network data management and mining system; hands on demonstration
  6. Future directions
  7. Pointers to more advanced material and resources

T2 - Symbolic Data Analysis

September 20, 2004 - Palazzo La Sapienza - Aula Magna Storica

In conjunction with Mining Complex Data Structures workshop.

Instructors

Edwin Diday, University of Paris-IX Dauphine, France.
diday@ceremade.dauphine.fr

Carlos Marcelo, National Statistical Institute, Portugal.
Carlos.Marcelo@ine.pt

Abstract

The need to extend standard exploratory, statistical and graphical data analysis methods to more complex data, that go beyond the classical framework is increasing, in order to describe complex units or concepts, to get more accurate information and to summarise extensive data sets contained in huge databases. This is the case of data concerning more or less homogeneous classes or groups of individuals - second-order objects or macro-data - instead of single individuals - first-order objects or micro-data. The extension of classical data analysis techniques to the analysis of second-order objects is one of the main goals of a novel research field named "Symbolic Data Analysis".

Symbolic Data Analysis allows defining concepts by a query on a database, aggregate initial data in order to describe these concepts (as symbolic data) and then apply analysis methods to extract knowledge from the set of modelled concepts. Symbolic data extend the classical tabular model, allowing multiple, possibly weighted, values for each descriptive attribute, which allow representing variability and/or uncertainty present in the data. Symbolic Data Analysis methods include univariate descriptive methods, visualizing methods, clustering, decision-trees, discrimination, regression, factorial analysis techniques and conceptual lattices, which allow analysing symbolic data tables.

Symbolic data occur in many situations, for instance in summarising huge sets of data or in describing the underlying concepts - a town, a socio-demographic group, a scenario of accidents - of a database. It also finds an important application field in official statistics; since by law, NSI's are prohibited from releasing individual responses to any other government agency or to any individual or business, data are aggregated for reasons of privacy before being distributed to external agencies and institutes. Symbolic Data Analysis provides useful tools to analyse such aggregated data.

Symbolic Data Analysis allows solving problems that arise in data analysis, in particular:

  • Large Database Treatment
  • Confidentiality
  • Missing Data
  • Metadata Modelling
  • Quality Control on Statistical Production
  • Accurate Data Interpretation
  • Use of Confidence Intervals
  • Joining of Independent Surveys
  • Exploitation of Survey Databases
Symbolic Data Analysis underwent great improvement with the European projects Symbolic Official Data Analysis System (SODAS) and Analysis System for Symbolic Official Data (ASSO) . As the result of these projects a software package SODAS has been developed.

T6 - Statistical Approaches used in Machine Learning

September 24, 2004 - Polo Piazza Cavalieri - Aula Fibonacci

Instructors

B. Apolloni, University of Milan, Milan, Italy

D. Malchiodi, University of Milan, Milan, Italy

Abstract

Machine Learning represents the new deal of statistical inference once powerful computational tools have been made available to scientists. The objects we want to infer are not yet simple parameters but entire functions. The data we process are not simple independent observations of a phenomenon; rather they represent complex links between different variables characterizing it. The inferenceÕs success depends highly on the sagacity of the algorithms processing these data in relation to their inner structure we want to discover. This tutorial provides a statistical framework for perceiving, discussing and solving the key inference problems on which a large family of machine learning instances are rooted.

The paradigmatic context is a string of data (possibly of infinite length) that we partition into a prefix we assume to be known at present (and therefore call a sample) and a suffix of unknown future data (we call a population). All these data share the feature of being observations of the same phenomenon, which is exactly the object of our inference. The basic inference tool is a twisting between properties we establish on the sample and random properties we are wondering on about the population, such as the probability of matching a specific digit.

Moving from the elementary problem of estimating the parameter of a Bernoulli variable, we will revisit two basic inference tools: the computation of confidence intervals and the search for point estimators with nice properties. Then we will go on to learning problems: while the theoretical tools remain unchanged, the sample properties to be twisted on the population must be wisely devised and smartly computed. As for Boolean variables, we restate the bases of PAC learning theory facing the usual related issues, such as: i) curse of dimensionality, ii) corrupted examples, and iii) special learning devices such as Support Vector Machines. Finally we will touch a few general statistical sentences that can be stated around neural network learning algorithms.

T4 - Radial Basis Functions: An Algebraic Approach (with Data Mining Applications)

September 20, 2004 - Polo Piazza Cavalieri - Aula Fibonacci

Instructors

Amrit L. Goel, EECS Dept., Syracuse University
goel@ecs.syr.edu

Miyoung Shin, Bioinformatics Team, ETRI, Taejon, Korea
shinmy@etri.re.kr

Abstract

Radial Basis Functions (RBF) have now become a very popular tool, both for classification and prediction tasks. The recent flurry of research in Support Vector Machines(SVM) has provided further impetus to their growth. Yet, most algorithms for their design are basically iterative and lead to irreproducible results. The authors of this tutorial have been working on an innovative new approach for the design and evaluation of radial basis function models. Our algorithm is based on purely algebraic considerations, is non-iterative, and yields reproducible designs .These features are completely unique to our new approach. We have employed this modeling methodology for many sets of problems in software engineering, microarray data analysis, and, recently, for clustering applications in bioinformatics. The purpose of this tutorial is to present the mathematical underpinning of this approach, describe its algorithmic details, and discuss selected datamining applications. Also, we want to briefly discuss how it compares and contrasts with the current SVM work. A brief outline of the proposed tutorial follows:

  • Radial Basis Function model
  • Algebraic preliminaries (singular value decomposition; QR factorization, etc.)
  • Mathematical underpinnings of the new methodology
  • Algorithmic details
  • Classification and prediction formulations
  • Datamining applications in software engineering, and bioinformatics
  • Comparative assessment and current focus.

T7 - Rule-based Data Mining Methods for Classification Problems in the Biomedical Domain

September 24, 2004 - Polo Piazza Cavalieri - Aula Fibonacci

Instructors

Jinyan Li, Institute for Infocomm Research, Singapore

Limsoon Wong, Institute for Infocomm Research, Singapore

Abstract

This is an introductory to intermediate level tutorial that takes about three and half hours. This tutorial aims to introduce the importance of rule-based data mining methods for solving biomedical classification problems. We expect final-year students and post-graduate students in computer science or computational biology, post-doctorates starting work on general data mining or bioinformatics topics to attend. No special statistics, mathematics, or computer science backgrounds are required.

There are four parts in the tutorial. The tutorial begins with a description of a data set repository where we have stored about 20 high-dimensional biomedical data sets; Then, we will move to talk about decision trees, committees of decision trees, and also talk about how they are used in bioinformatics; Then, we will talk about interesting rules and patterns discovered from the data sets; Finally, we will demonstrate how to use a software package which implements a wide range data mining and machine learning algorithms including our own algorithms. This software is also free of academic use.

T5 - Mining Unstructured Data

September 20, 2004 - Polo Piazza Cavalieri - Aula Fibonacci

Instructor

Ronen Feldman, ClearForest Corporation

Abstract

The information age has made it easy to store large amounts of data. The proliferation of documents available on the Web, on corporate intranets, on news wires, and elsewhere is overwhelming. However, while the amount of data available to us is constantly increasing, our ability to absorb and process this information remains constant. Search engines only exacerbate the problem by making more and more documents available in a matter of a few key strokes. Text Mining is a new and exciting research area that tries to solve the information overload problem by using techniques from data mining, machine learning, NLP, IR and knowledge management. Text Mining involves the preprocessing of document collections (text categorization, information extraction, term extraction), the storage of the intermediate representations, the techniques to analyze these intermediate representations (distribution analysis, clustering, trend analysis, association rules etc) and visualization of the results. In this tutorial we will present the general theory of Text Mining and will demonstrate several systems that use these principles to enable interactive exploration of large textual collections. We will present a general architecture for text mining and will outline the algorithms and data structures behind the systems. Special emphasis will be given to efficient algorithms for very large document collections, tools for visualizing such document collections, the use of intelligent agents to perform text mining on the internet, and the use information extraction to better capture the major themes of the documents. The Tutorial will cover the state of the art in this rapidly growing area of research. Several real world applications of text mining will be presented.