
Tutorials
T1  Evaluation in Web Mining
September 20, 2004  Polo Piazza Cavalieri  Aula Galilei
In conjunction with
Statistical Approaches to Web Mining workshop.
Instructors
Bettina Berendt, HumboldtUniversity, Berlin
Ernestina Menasalvas, University of Madrid
Myra Spiliopoulou, University of Magdeburg
Abstract
Web mining has become a critical tool for competitive
application intelligence. Understanding the behavior of a
site's visitors requires creative extensions of KDD
techniques for ecommerce and clickstream data: patterns
must be discovered from a variety of data sources, and
these patterns must be interpreted and transformed into
actionable knowledge for redesigns that bring
revenue. Redesigns encompass general improvements to
information architecture and navigation options, as well
as the offering of personalized recommendations and
services. At the same time, a reliable discovery and
interpretation of patterns cannot ignore the Web content
itself. This leads to challenges on Web content mining,
including text categorisation, content analysis and
extraction of implicit semantics.
These issues are already broadly recognized: The
research on Web mining is intensive and, in some cases,
goes handinhand with deployment in the market. This
leads to the challenge of incorporating Web mining to the
internal evaluation processes of the site operator. Web
mining can be used to derive indicators that describe
marketing success, the appropriateness of distribution
channel mixes, or other indicators of a site's or
service's success. At the same time, Web mining itself
constitutes a major investment and therefore needs to be
subjected to a costbenefit evaluation. Both of these
aspects, "Web mining for evaluation" and "Evaluation of
Web mining" require systematic methods and a context of
project management. The owners of Web sites and Web
applications need a complete evaluation framework, in
order to derive wellinformed decisions for the extend of
using Web mining as a tool for data analysis and for the
deployment of its results in site and service design.
In this tutorial, we investigate the current state of
Web mining evaluation from both viewpoints of evaluating a
Web site and evaluating Web mining projects themselves. In
particular, we address
 methodologies to derive interesting patterns and
success indicators from the data
 integration of the derived patterns with the goals
of the institution
 problems related to the development of the web
mining project in order to evaluate its success
 computational and applicationoriented frameworks
for determining costs and benefits, and the role of
different perspectives on "success";
 project management frameworks for integrating
these and other measures of application success, and
for accommodating their respective strengths and
weaknesses; and
 infrastructures for deploying Web mining
results.
The tutorial draws from the core domains of KDD,
covering issues of data preparation, pattern discovery,
and pattern analysis. We also draw on the domain of Web
marketing that contributes the requirements and the
economic measures, on humancomputer interaction for
usercentric success evaluation, and on project management
dealing with gaps to be filled in order to evaluate the
impact of a web mining project and having a measure of its
success.
Tutorial web page
T3  Distributed Data Mining for Sensor Networks
September 24, 2004  Polo Piazza Cavalieri  Aula Da Vinci (b)
In conjunction with
Knowledge Discovery in Data Streams workshop.
Instructor
Hillol Kargupta, University of Maryland, Baltimore County, Maryland, USA hillol@cs.umbc.edu
Abstract
Advances in computing and communication over wired and
wireless networks have resulted in many pervasive
distributed computing environments. Sensor networks define
one such class of distributed environments that offer many
interesting possibilities in many data and labor intensive
application domains. Monitoring forest fires, tracking
moving targets, identifying unusual behaviors from vehicle
data systems are some examples. Most of these sensor
networks deal with distributed and constrained computing,
storage, power, and communicationbandwidth
resources. Mining in such environments naturally calls for
proper utilization of these distributed
resources. Moreover, in many ubiquitous applications
privacy of the data is indeed an issue. For example,
consider the problem of monitoring driving behavior in a
commercial fleet using vehicle sensor networks. We must
protect the privacy of the good drivers; but report the
bad drivers in a fleet. A growing number of these
applications deal with distributed data streams that
require quick analysis and a quick response. Most
offtheshelf data mining systems are designed to work as
a monolithic centralized application primarily from static
data. They normally download the relevant data to a
centralized location and then perform the data mining
operations. This centralized approach may not always work
well in many of the distributed, ubiquitous, and possibly
privacysensitive data mining applications over sensor
networks. Data centralization may cause massive drainage
of power, increase response time, and make the overall
architecture not very scalable. The field of Distributed
Data Mining (DDM) offers an alternate choice. It pays
careful attention to the distributed resources of data,
computing, communication, and human factors in order to
use them in a near optimal fashion. This tutorial will
offer an introduction to the emerging field of Distributed
Data Mining, specifically in the context of sensor
networks. The attendees will be exposed to the following
aspects of this field:
 An overview of DDM and sensor networks
 An overview of the existing DDM algorithms
 More detailed discussion of some important DDM algorithms that are
appropriate for resource constrained sensor network applications.
 An overview of the systems research issues in DDM
 Detailed case study of an existing sensor network data management
and mining system; hands on demonstration
 Future directions
 Pointers to more advanced material and resources
T2  Symbolic Data Analysis
September 20, 2004  Palazzo La Sapienza  Aula Magna Storica
In conjunction with
Mining Complex Data Structures workshop.
Instructors
Edwin Diday, University of ParisIX Dauphine, France. diday@ceremade.dauphine.fr
Carlos Marcelo, National Statistical Institute, Portugal. Carlos.Marcelo@ine.pt
Abstract
The need to extend standard exploratory, statistical
and graphical data analysis methods to more complex data,
that go beyond the classical framework is increasing, in
order to describe complex units or concepts, to get more
accurate information and to summarise extensive data sets
contained in huge databases. This is the case of data
concerning more or less homogeneous classes or groups of
individuals  secondorder objects or macrodata  instead
of single individuals  firstorder objects or
microdata. The extension of classical data analysis
techniques to the analysis of secondorder objects is one
of the main goals of a novel research field named
"Symbolic Data Analysis".
Symbolic Data Analysis allows defining concepts by a
query on a database, aggregate initial data in order to
describe these concepts (as symbolic data) and then apply
analysis methods to extract knowledge from the set of
modelled concepts. Symbolic data extend the classical
tabular model, allowing multiple, possibly weighted,
values for each descriptive attribute, which allow
representing variability and/or uncertainty present in the
data. Symbolic Data Analysis methods include univariate
descriptive methods, visualizing methods, clustering,
decisiontrees, discrimination, regression, factorial
analysis techniques and conceptual lattices, which allow
analysing symbolic data tables.
Symbolic data occur in many situations, for instance in
summarising huge sets of data or in describing the
underlying concepts  a town, a sociodemographic group, a
scenario of accidents  of a database. It also finds an
important application field in official statistics; since
by law, NSI's are prohibited from releasing individual
responses to any other government agency or to any
individual or business, data are aggregated for reasons of
privacy before being distributed to external agencies and
institutes. Symbolic Data Analysis provides useful tools
to analyse such aggregated data.
Symbolic Data Analysis allows solving problems that
arise in data analysis, in particular:
 Large Database Treatment
 Confidentiality
 Missing Data
 Metadata Modelling
 Quality Control on Statistical Production
 Accurate Data Interpretation
 Use of Confidence Intervals
 Joining of Independent Surveys
 Exploitation of Survey Databases
Symbolic Data Analysis underwent great improvement with
the European projects
Symbolic Official Data
Analysis System (SODAS)
and
Analysis System
for Symbolic Official Data (ASSO)
. As the result of
these projects a software package SODAS has been
developed.
T6  Statistical Approaches used in Machine Learning
September 24, 2004  Polo Piazza Cavalieri  Aula Fibonacci
Instructors
B. Apolloni, University of Milan, Milan, Italy
D. Malchiodi, University of Milan, Milan, Italy
Abstract
Machine Learning represents the new deal of statistical
inference once powerful computational tools have been made
available to scientists. The objects we want to infer are
not yet simple parameters but entire functions. The data
we process are not simple independent observations of a
phenomenon; rather they represent complex links between
different variables characterizing it. The inferenceÕs
success depends highly on the sagacity of the algorithms
processing these data in relation to their inner structure
we want to discover. This tutorial provides a statistical
framework for perceiving, discussing and solving the key
inference problems on which a large family of machine
learning instances are rooted.
The paradigmatic context is a string of data (possibly
of infinite length) that we partition into a prefix we
assume to be known at present (and therefore call a
sample) and a suffix of unknown future data (we call a
population). All these data share the feature of being
observations of the same phenomenon, which is exactly the
object of our inference. The basic inference tool is a
twisting between properties we establish on the sample and
random properties we are wondering on about the
population, such as the probability of matching a specific
digit.
Moving from the elementary problem of estimating the
parameter of a Bernoulli variable, we will revisit two
basic inference tools: the computation of confidence
intervals and the search for point estimators with nice
properties. Then we will go on to learning problems: while
the theoretical tools remain unchanged, the sample
properties to be twisted on the population must be wisely
devised and smartly computed. As for Boolean variables, we
restate the bases of PAC learning theory facing the usual
related issues, such as: i) curse of dimensionality, ii)
corrupted examples, and iii) special learning devices such
as Support Vector Machines. Finally we will touch a few
general statistical sentences that can be stated around
neural network learning algorithms.
T4  Radial Basis Functions: An Algebraic Approach (with Data Mining Applications)
September 20, 2004  Polo Piazza Cavalieri  Aula Fibonacci
Instructors
Amrit L. Goel, EECS Dept., Syracuse University goel@ecs.syr.edu
Miyoung Shin, Bioinformatics Team, ETRI, Taejon, Korea shinmy@etri.re.kr
Abstract
Radial Basis Functions (RBF) have now become a very
popular tool, both for classification and prediction
tasks. The recent flurry of research in Support Vector
Machines(SVM) has provided further impetus to their
growth. Yet, most algorithms for their design are
basically iterative and lead to irreproducible
results. The authors of this tutorial have been working on
an innovative new approach for the design and evaluation
of radial basis function models. Our algorithm is based on
purely algebraic considerations, is noniterative, and
yields reproducible designs .These features are completely
unique to our new approach. We have employed this modeling
methodology for many sets of problems in software
engineering, microarray data analysis, and, recently, for
clustering applications in bioinformatics. The purpose of
this tutorial is to present the mathematical underpinning
of this approach, describe its algorithmic details, and
discuss selected datamining applications. Also, we want to
briefly discuss how it compares and contrasts with the
current SVM work. A brief outline of the proposed tutorial
follows:
 Radial Basis Function model
 Algebraic preliminaries (singular value decomposition; QR factorization, etc.)
 Mathematical underpinnings of the new methodology
 Algorithmic details
 Classification and prediction formulations
 Datamining applications in software engineering, and bioinformatics
 Comparative assessment and current focus.
T7 
Rulebased Data Mining Methods for Classification Problems in the Biomedical Domain
September 24, 2004  Polo Piazza Cavalieri  Aula Fibonacci
Instructors
Jinyan Li, Institute for Infocomm Research, Singapore
Limsoon Wong, Institute for Infocomm Research, Singapore
Abstract
This is an introductory to intermediate level tutorial
that takes about three and half hours. This tutorial aims
to introduce the importance of rulebased data mining
methods for solving biomedical classification problems.
We expect finalyear students and postgraduate students
in computer science or computational biology,
postdoctorates starting work on general data mining or
bioinformatics topics to attend. No special statistics,
mathematics, or computer science backgrounds are
required.
There are four parts in the tutorial. The tutorial
begins with a description of a data set repository where
we have stored about 20 highdimensional biomedical data
sets; Then, we will move to talk about decision trees,
committees of decision trees, and also talk about how they
are used in bioinformatics; Then, we will talk about
interesting rules and patterns discovered from the data
sets; Finally, we will demonstrate how to use a software
package which implements a wide range data mining and
machine learning algorithms including our own algorithms.
This software is also free of academic use.
T5  Mining Unstructured Data
September 20, 2004  Polo Piazza Cavalieri  Aula Fibonacci
Instructor
Ronen Feldman, ClearForest Corporation
Abstract
The information age has made it easy to store large
amounts of data. The proliferation of documents available
on the Web, on corporate intranets, on news wires, and
elsewhere is overwhelming. However, while the amount of
data available to us is constantly increasing, our ability
to absorb and process this information remains
constant. Search engines only exacerbate the problem by
making more and more documents available in a matter of a
few key strokes. Text Mining is a new and exciting
research area that tries to solve the information overload
problem by using techniques from data mining, machine
learning, NLP, IR and knowledge management. Text Mining
involves the preprocessing of document collections (text
categorization, information extraction, term extraction),
the storage of the intermediate representations, the
techniques to analyze these intermediate representations
(distribution analysis, clustering, trend analysis,
association rules etc) and visualization of the
results. In this tutorial we will present the general
theory of Text Mining and will demonstrate several systems
that use these principles to enable interactive
exploration of large textual collections. We will present
a general architecture for text mining and will outline
the algorithms and data structures behind the
systems. Special emphasis will be given to efficient
algorithms for very large document collections, tools for
visualizing such document collections, the use of
intelligent agents to perform text mining on the internet,
and the use information extraction to better capture the
major themes of the documents. The Tutorial will cover the
state of the art in this rapidly growing area of
research. Several real world applications of text mining
will be presented.

