<<PREV
4.5 Outlier Analysis
A data set may contain objects that do not comply with the general behavior
or model of the data. These data objects are outliers. Many data mining methods
discard outliers as noise or exceptions. However, in some applications (e.g.,
fraud detection) the rare events can be more interesting than the more regularly
occurring ones. The analysis of outlier data is referred to as outlier analysis
or anomaly mining.
Outliers may be detected using statistical tests that assume a distribution
or probability model for the data, or using distance measures where objects
that are remote from any other cluster are considered outliers. Rather than
using statistical or distance measures, density-based methods may identify
outliers in a local region, although they look normal from a global statistical
distribution view.
Example 10: Outlier analysis. Outlier analysis may uncover fraudulent usage
of credit cards by detecting purchases of unusually large amounts for a given
account number in comparison to regular charges incurred by the same account.
Outlier values may also be detected with respect to the locations and types
of purchase, or the purchase frequency.
Outlier analysis is discussed in Section 12.
4.6 Are All Patterns Interesting?
A data mining system has the potential to generate thousands or even millions
of patterns, or rules.
You may ask, "Are all of the patterns interesting?" Typically, the
answer is no-only a small fraction of the patterns potentially generated would
actually be of interest to a given user.
This raises some serious questions for data mining. You may wonder, "What
makes a pattern interesting? Can a data mining system generate all of the interesting
patterns? Or, Can the system generate only the interesting ones?" To answer
the first question, a pattern is interesting if it is (1) easily understood
by humans, (2) valid on new or test data with some degree of certainty, (3)
potentially useful, and (4) novel. A pattern is also interesting if it validates
a hypothesis that the user sought to confirm. An interesting pattern represents
knowledge.
Several objective measures of pattern interestingness exist. These are based
on the structure of discovered patterns and the statistics underlying them.
An objective measure for association rules of the form X )Y is rule support,
representing the percentage of transactions from a transaction database that
the given rule satisfies. This is taken to be the probability P.X [Y/, where
X [Y indicates that a transaction contains both X and Y, that is, the union
of item sets X and Y. Another objective measure for association rules is confidence,
which assesses the degree of certainty of the detected association. This is
taken to be the conditional probability P.YjX), that is, the probability that
a transaction containing X also contains Y. More formally, support and confidence
are defined as
support.X )Y/ D P.X [Y/,
confidence.X )Y/ D P.YjX/.
In general, each interestingness measure is associated with a threshold, which
may be controlled by the user. For example, rules that do not satisfy a confidence
threshold of, say, 50% can be considered uninteresting. Rules below the threshold
likely reflect noise, exceptions, or minority cases and are probably of less
value.
Other objective interestingness measures include accuracy and coverage for
classification (IF-THEN) rules. In general terms, accuracy tells us the percentage
of data that are correctly classified by a rule. Coverage is similar to support,
in that it tells us the percentage of data to which a rule applies. Regarding
understandability, we may use simple objective measures that assess the complexity
or length in bits of the patterns mined.
Although objective measures help identify interesting patterns, they are often
insufficient unless combined with subjective measures that reflect a particular
user's needs and interests. For example, patterns describing the characteristics
of customers who shop frequently at All Electronics should be interesting to
the marketing manager, but may be of little interest to other analysts studying
the same database for patterns on employee performance. Furthermore, many patterns
that are interesting by objective standards may represent common sense and,
therefore, are actually uninteresting.
Subjective interestingness measures are based on user beliefs in the data.
These measures find patterns interesting if the patterns are unexpected (contradicting
a user's belief) or offer strategic information on which the user can act.
In the latter case, such patterns are referred to as actionable. For example,
patterns like "a large earthquake often follows a cluster of small quakes" may
be highly actionable if users can act on the information to save lives. Patterns
that are expected can be interesting if they confirm a hypothesis that the
user wishes to validate or they resemble a user's hunch.
The second question--"Can a data mining system generate all of the interesting
patterns?"--refers to the completeness of a data mining algorithm. It
is often unrealistic and inefficient for data mining systems to generate all
possible patterns. Instead, user provided constraints and interestingness measures
should be used to focus the search.
For some mining tasks, such as association, this is often sufficient to ensure
the completeness of the algorithm. Association rule mining is an example where
the use of constraints and interestingness measures can ensure the completeness
of mining. The methods involved are examined in detail in Section 6.
Finally, the third question--"Can a data mining system generate only
interesting pat terns?"--is an optimization problem in data mining. It
is highly desirable for data mining systems to generate only interesting patterns.
This would be efficient for users and data-mining systems because neither would
have to search through the patterns generated to identify the truly interesting
ones. Progress has been made in this direction; however, such optimization
remains a challenging issue in data mining.
Measures of pattern interestingness are essential for the efficient discovery
of patterns by target users. Such measures can be used after the data mining
step to rank the discovered patterns according to their interestingness, filtering
out the uninteresting ones.
More important, such measures can be used to guide and constrain the discovery
pro cess, improving the search efficiency by pruning away subsets of the pattern
space that do not satisfy prespecified interestingness constraints. Examples
of such a constraint based mining process are described in Section 7 (with
respect to pattern discovery) and Section 11 (with respect to clustering).
Methods to assess pattern interestingness, and their use to improve data mining
efficiency, are discussed throughout the guide with respect to each kind of
pattern that can be mined.
5. Which Technologies Are Used?
As a highly application-driven domain, data mining has incorporated many techniques
from other domains such as statistics, machine learning, pattern recognition,
database and data warehouse systems, information retrieval, visualization,
algorithms, high performance computing, and many application domains (FIG. 11). The interdisciplinary nature of data mining research and development
contributes significantly to the success of data mining and its extensive applications.
In this section, we give examples of several disciplines that strongly influence
the development of data mining methods.

FIG. 11 Data mining adopts techniques from many domains.
5.1 Statistics
Statistics studies the collection, analysis, interpretation or explanation,
and presentation of data. Data mining has an inherent connection with statistics.
A statistical model is a set of mathematical functions that describe the behavior
of the objects in a target class in terms of random variables and their associated
probability distributions. Statistical models are widely used to model data
and data classes.
For example, in data mining tasks like data characterization and classification,
statistical models of target classes can be built. In other words, such statistical
models can be the outcome of a data mining task. Alternatively, data mining
tasks can be built on top of statistical models. For example, we can use statistics
to model noise and missing data values. Then, when mining patterns in a large
data set, the data mining process can use the model to help identify and handle
noisy or missing values in the data.
Statistics research develops tools for prediction and forecasting using data
and statistical models. Statistical methods can be used to summarize or describe
a collection of data. Basic statistical descriptions of data are introduced
in Section 2. Statistics is useful for mining various patterns from data as
well as for understanding the underlying mechanisms generating and affecting
the patterns. Inferential statistics (or predictive statistics) models data
in a way that accounts for randomness and uncertainty in the observations and
is used to draw inferences about the process or population under investigation.
Statistical methods can also be used to verify data mining results. For example,
after a classification or prediction model is mined, the model should be verified
by statistical hypothesis testing. A statistical hypothesis test (sometimes
called confirmatory data analysis)makes statistical decisions using experimental
data. A result is called statistically significant if it is unlikely to have
occurred by chance. If the classification or prediction model holds true, then
the descriptive statistics of the model increases the soundness of the model.
Applying statistical methods in data mining is far from trivial. Often, a
serious challenge is how to scale up a statistical method over a large data
set. Many statistical methods have high complexity in computation. When such
methods are applied on large data sets that are also distributed on multiple
logical or physical sites, algorithms should be carefully designed and tuned
to reduce the computational cost. This challenge becomes even tougher for online
applications, such as online query suggestions in search engines, where data
mining is required to continuously handle fast, real-time data streams.
5.2 Machine Learning
Machine learning investigates how computers can learn (or improve their performance)
based on data. A main research area is for computer programs to automatically
learn to recognize complex patterns and make intelligent decisions based on
data. For example, a typical machine learning problem is to program a computer
so that it can automatically recognize handwritten postal codes on mail after
learning from a set of examples.
Machine learning is a fast-growing discipline. Here, we illustrate classic
problems in machine learning that are highly related to data mining.
Supervised learning is basically a synonym for classification. The supervision
in the learning comes from the labeled examples in the training data set. For
example, in the postal code recognition problem, a set of handwritten postal
code images and their corresponding machine-readable translations are used
as the training examples, which supervise the learning of the classification
model.
Unsupervised learning is essentially a synonym for clustering. The learning
process is unsupervised since the input examples are not class labeled. Typically,
we may use clustering to discover classes within the data. For example, an
unsupervised learning method can take, as input, a set of images of handwritten
digits. Suppose that it finds 10 clusters of data. These clusters may correspond
to the 10 distinct digits of 0 to 9, respectively. However, since the training
data are not labeled, the learned model cannot tell us the semantic meaning
of the clusters found.
Semi-supervised learning is a class of machine learning techniques that make
use of both labeled and unlabeled examples when learning a model. In one approach,
labeled examples are used to learn class models and unlabeled examples are
used to refine the boundaries between classes. For a two-class problem, we
can think of the set of examples belonging to one class as the positive examples
and those belonging to the other class as the negative examples. In FIG. 12, if we do not consider the unlabeled examples, the dashed line is the
decision boundary that best partitions the positive examples from the negative
examples. Using the unlabeled examples, we can refine the decision boundary
to the solid line. Moreover, we can detect that the two positive examples at
the top right corner, though labeled, are likely noise or outliers.
Active learning is a machine learning approach that lets users play an active
role in the learning process. An active learning approach can ask a user (e.g.,
a domain expert) to label an example, which may be from a set of unlabeled
examples or synthesized by the learning program. The goal is to optimize the
model quality by actively acquiring knowledge from human users, given a constraint
on how many examples they can be asked to label.

FIG. 12 Semi-supervised learning.
You can see there are many similarities between data mining and machine learning.
For classification and clustering tasks, machine learning research often focuses
on the accuracy of the model. In addition to accuracy, data mining research
places strong emphasis on the efficiency and scalability of mining methods
on large data sets, as well as on ways to handle complex types of data and
explore new, alternative methods.
5.3 Database Systems and Data Warehouses
Database systems research focuses on the creation, maintenance, and use of
databases for organizations and end-users. Particularly, database systems researchers
have established highly recognized principles in data models, query languages,
query processing and optimization methods, data storage, and indexing and accessing
methods. Database systems are often well known for their high scalability in
processing very large, relatively structured data sets.
Many data mining tasks need to handle large data sets or even real-time, fast
streaming data. Therefore, data mining can make good use of scalable database
technologies to achieve high efficiency and scalability on large data sets.
Moreover, data-mining tasks can be used to extend the capability of existing
database systems to satisfy advanced users' sophisticated data analysis requirements.
Recent database systems have built systematic data analysis capabilities on
database data using data warehousing and data mining facilities. A data warehouse
integrates data originating from multiple sources and various timeframes. It
consolidates data in multidimensional space to form partially materialized
data cubes. The data cube model not only facilitates OLAP in multidimensional
databases but also promotes multidimensional data mining.
5.4 Information Retrieval
Information retrieval (IR) is the science of searching for documents or information
in documents. Documents can be text or multimedia, and may reside on the Web.
The differences between traditional information retrieval and database systems
are twofold:
Information retrieval assumes that (1) the data under search are unstructured;
and (2) the queries are formed mainly by keywords, which do not have complex
structures (unlike SQL queries in database systems).
The typical approaches in information retrieval adopt probabilistic models.
For example, a text document can be regarded as a bag of words, that is, a
multiset of words appearing in the document. The document's language model
is the probability density function that generates the bag of words in the
document. The similarity between two documents can be measured by the similarity
between their corresponding language models.
Furthermore, a topic in a set of text documents can be modeled as a probability
distribution over the vocabulary, which is called a topic model. A text document,
which may involve one or multiple topics, can be regarded as a mixture of multiple
topic models. By integrating information retrieval models and data mining techniques,
we can find the major topics in a collection of documents and, for each document
in the collection, the major topics involved.
Increasingly large amounts of text and multimedia data have been accumulated
and made available online due to the fast growth of the Web and applications
such as dig ital libraries, digital governments, and health care information
systems. Their effective search and analysis have raised many challenging issues
in data mining. Therefore, text mining and multimedia data mining, integrated
with information retrieval methods, have become increasingly important.
6. Which Kinds of Applications Are Targeted?
Where there are data, there are data mining applications As a highly application-driven
discipline, data mining has seen great successes in many applications. It is
impossible to enumerate all applications where data mining plays a critical
role. Presentations of data mining in knowledge-intensive application domains,
such as bioinformatics and software engineering, require more in-depth treatment
and are beyond the scope of this guide. To demonstrate the importance of applications
as a major dimension in data mining research and development, we briefly discuss
two highly successful and popular application examples of data mining: business
intelligence and search engines.
6.1 Business Intelligence
It is critical for businesses to acquire a better understanding of the commercial
context of their organization, such as their customers, the market, supply
and resources, and competitors. Business intelligence (BI) technologies provide
historical, current, and predictive views of business operations. Examples
include reporting, online analytical processing, business performance management,
competitive intelligence, benchmarking, and predictive analytics.
"How important is business intelligence?" Without data mining, many
businesses may not be able to perform effective market analysis, compare customer
feedback on similar products, discover the strengths and weaknesses of their
competitors, retain highly valuable customers, and make smart business decisions.
Clearly, data mining is the core of business intelligence. Online analytical
processing tools in business intelligence rely on data warehousing and multidimensional
data mining. Classification and prediction techniques are the core of predictive
analytics in business intelligence, for which there are many applications in
analyzing markets, supplies, and sales. Moreover, clustering plays a central
role in customer relationship management, which groups customers based on their
similarities. Using characterization mining techniques, we can better understand
features of each customer group and develop customized customer reward programs.
6.2 Web Search Engines
A Web search engine is a specialized computer server that searches for information
on the Web. The search results of a user query are often returned as a list
(sometimes called hits). The hits may consist of web pages, images, and other
types of files. Some search engines also search and return data available in
public databases or open directories. Search engines differ from web directories
in that web directories are maintained by human editors whereas search engines
operate algorithmically or by a mixture of algorithmic and human input.
Web search engines are essentially very large data mining applications. Various
data mining techniques are used in all aspects of search engines, ranging from
crawling (e.g., deciding which pages should be crawled and the crawling frequencies),
indexing (e.g., selecting pages to be indexed and deciding to which extent
the index should be constructed), and searching (e.g., deciding how pages should
be ranked, which advertisements should be added, and how the search results
can be personalized or made "context aware").
Search engines pose grand challenges to data mining. First, they have to handle
a huge and ever-growing amount of data. Typically, such data cannot be processed
using one or a few machines. Instead, search engines often need to use computer
clouds, which consist of thousands or even hundreds of thousands of computers
that collaboratively mine the huge amount of data. Scaling up data mining methods
over computer clouds and large distributed data sets is an area for further
research.
Second, Web search engines often have to deal with online data. A search engine
may be able to afford constructing a model offline on huge data sets. To do
this, it may construct a query classifier that assigns a search query to predefined
categories based on the query topic (i.e., whether the search query "apple" is
meant to retrieve information about a fruit or a brand of computers). Whether
a model is constructed offline, the application of the model online must be
fast enough to answer user queries in real time.
Another challenge is maintaining and incrementally updating a model on fast
growing data streams. For example, a query classifier may need to be incrementally
maintained continuously since new queries keep emerging and predefined categories
and the data distribution may change. Most of the existing model training methods
are offline and static and thus cannot be used in such a scenario.
Third, Web search engines often have to deal with queries that are asked only
a very small number of times. Suppose a search engine wants to provide context-aware
query recommendations. That is, when a user poses a query, the search engine
tries to infer the context of the query using the user's profile and his query
history in order to return more customized answers within a small fraction
of a second. However, although the total number of queries asked can be huge,
most of the queries may be asked only once or a few times. Such severely skewed
data are challenging for many data mining and machine learning methods.
[ A Web crawler is a computer program that browses the Web in a methodical,
automated manner.]
7. Major Issues in Data Mining
Life is short but art is long. - Hippocrates
Data mining is a dynamic and fast-expanding field with great strengths. In
this section, we briefly outline the major issues in data mining research,
partitioning them into five groups: mining methodology, user interaction, efficiency
and scalability, diversity of data types, and data mining and society. Many
of these issues have been addressed in recent data mining research and development
to a certain extent and are now considered data mining requirements; others
are still at the research stage. The issues continue to stimulate further investigation
and improvement in data mining.
7.1 Mining Methodology
Researchers have been vigorously developing new data mining methodologies.
This involves the investigation of new kinds of knowledge, mining in multidimensional
space, integrating methods from other disciplines, and the consideration of
semantic ties among data objects. In addition, mining methodologies should
consider issues such as data uncertainty, noise, and incompleteness. Some mining
methods explore how user specified measures can be used to assess the interestingness
of discovered patterns as well as guide the discovery process. Let's have a
look at these various aspects of mining methodology.
Mining various and new kinds of knowledge: Data mining covers a wide spectrum
of data analysis and knowledge discovery tasks, from data characterization
and discrimination to association and correlation analysis, classification,
regression, clustering, outlier analysis, sequence analysis, and trend and
evolution analysis. These tasks may use the same database in different ways
and require the development of numerous data mining techniques. Due to the
diversity of applications, new mining tasks continue to emerge, making data
mining a dynamic and fast-growing field. For example, for effective knowledge
discovery in information networks, integrated clustering and ranking may lead
to the discovery of high-quality clusters and object ranks in large networks.
Mining knowledge in multidimensional space: When searching for knowledge in
large data sets, we can explore the data in multidimensional space. That is,
we can search for interesting patterns among combinations of dimensions (attributes)
at varying levels of abstraction. Such mining is known as (exploratory) multidimensional
data mining. In many cases, data can be aggregated or viewed as a multidimensional
data cube. Mining knowledge in cube space can substantially enhance the power
and flexibility of data mining.
Data mining-an interdisciplinary effort: The power of data mining can be substantially
enhanced by integrating new methods from multiple disciplines. For example,
to mine data with natural language text, it makes sense to fuse data mining
methods with methods of information retrieval and natural language processing.
As another example, consider the mining of software bugs in large programs.
This form of mining, known as bug mining, benefits from the incorporation of
software engineering knowledge into the data mining process.
Boosting the power of discovery in a networked environment: Most data objects
reside in a linked or interconnected environment, whether it be the Web, database
relations, files, or documents. Semantic links across multiple data objects
can be used to advantage in data mining. Knowledge derived in one set of objects
can be used to boost the discovery of knowledge in a "related" or
semantically linked set of objects.
Handling uncertainty, noise, or incompleteness of data: Data often contain
noise, errors, exceptions, or uncertainty, or are incomplete. Errors and noise
may confuse the data mining process, leading to the derivation of erroneous
patterns. Data cleaning, data preprocessing, outlier detection and removal,
and uncertainty reasoning are examples of techniques that need to be integrated
with the data mining process.
Pattern evaluation and pattern- or constraint-guided mining: Not all the patterns
generated by data mining processes are interesting. What makes a pattern interesting
may vary from user to user. Therefore, techniques are needed to assess the
interestingness of discovered patterns based on subjective measures. These
estimate the value of patterns with respect to a given user class, based on
user beliefs or expectations. Moreover, by using interestingness measures or
user-specified constraints to guide the discovery process, we may generate
more interesting patterns and reduce the search space.
7.2 User Interaction
The user plays an important role in the data mining process. Interesting areas
of research include how to interact with a data mining system, how to incorporate
a user's back ground knowledge in mining, and how to visualize and comprehend
data mining results.
We introduce each of these here.
Interactive mining: The data mining process should be highly interactive.
Thus, it is important to build flexible user interfaces and an exploratory
mining environment, facilitating the user's interaction with the system. A
user may like to first sample a set of data, explore general characteristics
of the data, and estimate potential mining results. Interactive mining should
allow users to dynamically change the focus of a search, to refine mining requests
based on returned results, and to drill, dice, and pivot through the data and
knowledge space interactively, dynamically exploring "cube space" while
mining.
Incorporation of background knowledge: Background knowledge, constraints,
rules, and other information regarding the domain under study should be incorporated
into the knowledge discovery process. Such knowledge can be used for pattern
evaluation as well as to guide the search toward interesting patterns.
Ad hoc data mining and data mining query languages: Query languages (e.g.,
SQL) have played an important role in flexible searching because they allow
users to pose ad hoc queries. Similarly, high-level data mining query languages
or other high-level flexible user interfaces will give users the freedom to
define ad hoc data mining tasks.
This should facilitate specification of the relevant sets of data for analysis,
the domain knowledge, the kinds of knowledge to be mined, and the conditions
and constraints to be enforced on the discovered patterns. Optimization of
the processing of such flexible mining requests is another promising area of
study.
Presentation and visualization of data mining results: How can a data mining
system present data mining results, vividly and flexibly, so that the discovered
knowledge can be easily understood and directly usable by humans? This is especially
crucial if the data mining process is interactive. It requires the system to
adopt expressive knowledge representations, user-friendly interfaces, and visualization
techniques.
7.3 Efficiency and Scalability
Efficiency and scalability are always considered when comparing data mining
algorithms. As data amounts continue to multiply, these two factors are especially
critical.
Efficiency and scalability of data mining algorithms: Data mining algorithms
must be efficient and scalable in order to effectively extract information
from huge amounts of data in many data repositories or in dynamic data streams.
In other words, the running time of a data mining algorithm must be predictable,
short, and acceptable by applications. Efficiency, scalability, performance,
optimization, and the ability to execute in real time are key criteria that
drive the development of many new data mining algorithms.
Parallel, distributed, and incremental mining algorithms: The humongous size
of many data sets, the wide distribution of data, and the computational complexity
of some data mining methods are factors that motivate the development of parallel
and distributed data-intensive mining algorithms. Such algorithms first partition
the data into "pieces." Each piece is processed, in parallel, by
searching for patterns. The parallel processes may interact with one another.
The patterns from each partition are eventually merged.
Cloud computing and cluster computing, which use computers in a distributed
and collaborative way to tackle very large-scale computational tasks, are also
active research themes in parallel data mining. In addition, the high cost
of some data mining processes and the incremental nature of input promote incremental
data mining, which incorporates new data updates without having to mine the
entire data "from scratch." Such methods perform knowledge modification
incrementally to amend and strengthen what was previously discovered.
7.4 Diversity of Database Types
The wide diversity of database types brings about challenges to data mining.
These include
Handling complex types of data: Diverse applications generate a wide spectrum
of new data types, from structured data such as relational and data warehouse
data to semi-structured and unstructured data; from stable data repositories
to dynamic data streams; from simple data objects to temporal data, biological
sequences, sensor data, spatial data, hypertext data, multimedia data, software
program code, Web data, and social network data. It is unrealistic to expect
one data mining system to mine all kinds of data, given the diversity of data
types and the different goals of data mining.
Domain- or application-dedicated data mining systems are being constructed
for in depth mining of specific kinds of data. The construction of effective
and efficient data mining tools for diverse applications remains a challenging
and active area of research.
Mining dynamic, networked, and global data repositories: Multiple sources
of data are connected by the Internet and various kinds of networks, forming
gigantic, distributed, and heterogeneous global information systems and networks.
The discovery of knowledge from different sources of structured, semi-structured,
or unstructured yet interconnected data with diverse data semantics poses great
challenges to data mining. Mining such gigantic, interconnected information
networks may help disclose many more patterns and knowledge in heterogeneous
data sets than can be discovered from a small set of isolated data repositories.
Web mining, multisource data mining, and information network mining have become
challenging and fast-evolving data mining fields.
7.5 Data Mining and Society
How does data mining impact society? What steps can data mining take to preserve
the privacy of individuals? Do we use data mining in our daily lives without
even knowing that we do? These questions raise the following issues:
Social impacts of data mining: With data mining penetrating our everyday lives,
it is important to study the impact of data mining on society. How can we use
data mining technology to benefit society? How can we guard against its misuse?
The improper disclosure or use of data and the potential violation of individual
privacy and data protection rights are areas of concern that need to be addressed.
Privacy-preserving data mining: Data mining will help scientific discovery,
business management, economy recovery, and security protection (e.g., the real-time
discovery of intruders and cyber attacks). However, it poses the risk of disclosing
an individual's personal information. Studies on privacy-preserving data publishing
and data mining are ongoing. The philosophy is to observe data sensitivity
and preserve people's privacy while performing successful data mining.
Invisible data mining: We cannot expect everyone in society to learn and master
data mining techniques. More and more systems should have data mining functions
built within so that people can perform data mining or use data mining results
simply by mouse clicking, without any knowledge of data mining algorithms.
Intelligent search engines and Internet-based stores perform such invisible
data mining by incorporating data mining into their components to improve their
functionality and performance. This is done often unbeknownst to the user.
For example, when purchasing items online, users may be unaware that the store
is likely collecting data on the buying patterns of its customers, which may
be used to recommend other items for purchase in the future.
These issues and many additional ones relating to the research, development,
and application of data mining are discussed throughout the guide.
8. Summary
Necessity is the mother of invention. With the mounting growth of data in
every application, data mining meets the imminent need for effective, scalable,
and flexible data analysis in our society. Data mining can be considered as
a natural evolution of information technology and a confluence of several related
disciplines and application domains.
Data mining is the process of discovering interesting patterns from massive
amounts of data. As a knowledge discovery process, it typically involves data
cleaning, data integration, data selection, data transformation, pattern discovery,
pattern evaluation, and knowledge presentation.
A pattern is interesting if it is valid on test data with some degree of certainty,
novel, potentially useful (e.g., can be acted on or validates a hunch about
which the user was curious), and easily understood by humans. Interesting patterns
represent knowledge. Measures of pattern interestingness, either objective
or subjective, can be used to guide the discovery process.
We present a multidimensional view of data mining. The major dimensions are
data, knowledge, technologies, and applications.
Data mining can be conducted on any kind of data as long as the data are meaningful
for a target application, such as database data, data warehouse data, transactional
data, and advanced data types. Advanced data types include time-related or
sequence data, data streams, spatial and spatiotemporal data, text and multimedia
data, graph and networked data, and Web data.
A data warehouse is a repository for long-term storage of data from multiple
sources, organized so as to facilitate management decision making. The data
are stored under a unified schema and are typically summarized. Data warehouse
systems pro vide multidimensional data analysis capabilities, collectively
referred to as online analytical processing.
Multidimensional data mining (also called exploratory multidimensional data
mining) integrates core data mining techniques with OLAP-based multidimensional
analysis. It searches for interesting patterns among multiple combinations
of dimensions (attributes) at varying levels of abstraction, thereby exploring
multi dimensional data space.
Data mining functionalities are used to specify the kinds of patterns or knowledge
to be found in data mining tasks. The functionalities include characterization
and discrimination; the mining of frequent patterns, associations, and correlations;
classification and regression; cluster analysis; and outlier detection. As
new types of data, new applications, and new analysis demands continue to emerge,
there is no doubt we will see more and more novel data mining tasks in the
future.
Data mining, as a highly application-driven domain, has incorporated technologies
from many other domains. These include statistics, machine learning, database
and data warehouse systems, and information retrieval. The interdisciplinary
nature of data mining research and development contributes significantly to
the success of data mining and its extensive applications.
Data mining has many successful applications, such as business intelligence,
Web search, bioinformatics, health informatics, finance, digital libraries,
and digital governments.
There are many challenging issues in data mining research. Areas include mining
methodology, user interaction, efficiency and scalability, and dealing with
diverse data types. Data mining research has strongly impacted society and
will continue to do so in the future.
9. Exercises
1 What is data mining? In your answer, address the following:
(a) Is it another hype?
(b) Is it a simple transformation or application of technology developed from
databases, statistics, machine learning, and pattern recognition?
(c) We have presented a view that data mining is the result of the evolution
of database technology. Do you think that data mining is also the result of
the evolution of machine learning research? Can you present such views based
on the historical progress of this discipline? Address the same for the fields
of statistics and pattern recognition.
(d) Describe the steps involved in data mining when viewed as a process of
knowledge discovery.
2 How is a data warehouse different from a database? How are they similar?
3 Define each of the following data mining functionalities: characterization,
discrimination, association and correlation analysis, classification, regression,
clustering, and outlier analysis. Give examples of each data mining functionality,
using a real-life database that you are familiar with.
4 Present an example where data mining is crucial to the success of a business.
What data mining functionalities does this business need (e.g., think of the
kinds of patterns that could be mined)? Can such patterns be generated alternatively
by data query processing or simple statistical analysis?
5 Explain the difference and similarity between discrimination and classification,
between characterization and clustering, and between classification and regression.
6 Based on your observations, describe another possible kind of knowledge
that needs to be discovered by data mining methods but has not been listed
in this section. Does it require a mining methodology that is quite different
from those outlined in this section?
7 Outliers are often discarded as noise. However, one person's garbage could
be another's treasure. For example, exceptions in credit card transactions
can help us detect the fraudulent use of credit cards. Using fraudulence detection
as an example, propose two methods that can be used to detect outliers and
discuss which one is more reliable.
8 Describe three challenges to data mining regarding data mining methodology
and user interaction issues.
1.9 What are the major challenges of mining a huge amount of data (e.g., billions
of tuples) in comparison with mining a small amount of data (e.g., data set
of a few hundred tuple)?
10 Outline the major research challenges of data mining in one specific
application domain, such as stream/sensor data analysis, spatiotemporal data
analysis, or bioinformatics.
References
The book Knowledge Discovery in Databases, edited by Piatetsky-Shapiro and
Frawley [P-SF91], is an early collection of research papers on knowledge discovery
from data.
The book Advances in Knowledge Discovery and Data Mining, edited by Fayyad,
Piatetsky-Shapiro, Smyth, and Uthurusamy [FPSS+96], is a collection of later
research results on knowledge discovery and data mining. There have been many
data mining books published in recent years, including The Elements of Statistical
Learning by Hastie, Tibshirani, and Friedman [HTF09]; Introduction to Data
Mining by Tan, Steinbach, and Kumar [TSK05]; Data Mining: Practical Machine
Learning Tools and Techniques with Java Implementations by Witten, Frank, and
Hall [WFH11]; Predictive Data Mining by Weiss and Indurkhya [WI98]; Mastering
Data Mining: The Art and Science of Customer Relationship Management by Berry
and Linoff [BL99]; Principles of Data Mining (Adaptive Computation and Machine
Learning) by Hand,Mannila, and Smyth [HMS01]; Mining the Web: Discovering Knowledge
from Hypertext Data by Chakrabarti [Cha03a]; Web Data Mining: Exploring Hyperlinks,
Contents, and Usage Data by Liu [Liu06]; Data Mining: Introductory and Advanced
Topics by Dunham
[Dun03]; and Data Mining: Multimedia, Soft Computing, and Bioinformatics by
Mitra and Acharya [MA03].
There are also books that contain collections of papers or sections on particular
aspects of knowledge discovery-for example, Relational Data Mining edited by
Dzeroski and Lavrac [De01]; Mining Graph Data edited by Cook and Holder [CH07];
Data Streams: Models and Algorithms edited by Aggarwal [Agg06]; Next Generation
of Data Mining edited by Kargupta, Han, Yu, et al. [KHYC08]; Multimedia Data
Mining: A Systematic Introduction to Concepts and Theory edited by Z. Zhang
and R. Zhang [ZZ09];
Geographic Data Mining and Knowledge Discovery edited by Miller and Han [MH09];
and Link Mining: Models, Algorithms and Applications edited by Yu, Han, and
Faloutsos [YHF10]. There are many tutorial notes on data mining in major databases,
data mining, machine learning, statistics, and Web technology conferences.
KDNuggets is a regular electronic newsletter containing information relevant
to knowledge discovery and data mining, moderated by Piatetsky-Shapiro since
1991.
The Internet site KDNuggets (www.kdnuggets.com) contains a good collection
of KDD related information.
The data mining community started its first international conference on knowledge
discovery and data mining in 1995. The conference evolved from the four inter
national workshops on knowledge discovery in databases, held from 1989 to 1994.
ACM-SIGKDD, a Special Interest Group on Knowledge Discovery in Databases was
set up under ACM in 1998 and has been organizing the international conferences
on knowledge discovery and data mining since 1999. IEEE Computer Science Society
has organized its annual data mining conference, International Conference on
Data Mining (ICDM), since 2001. SIAM (Society on Industrial and Applied Mathematics)
has organized its annual data mining conference, SIAM Data Mining Conference
(SDM), since 2002. A dedicated journal, Data Mining and Knowledge Discovery,
published by Kluwers Publishers, has been available since 1997. An ACM journal,
ACM Transactions on Knowledge Discovery from Data, published its first volume
in 2007.
ACM-SIGKDD also publishes a bi-annual newsletter, SIGKDD Explorations. There
are a few other international or regional conferences on data mining, such
as the European Conference on Machine Learning and Principles and Practice
of Knowledge Discovery in Databases (ECML PKDD), the Pacific-Asia Conference
on Knowledge Discovery and Data Mining (PAKDD), and the International Conference
on Data Warehousing and Knowledge Discovery (DaWaK).
Research in data mining has also been published in books, conferences, and
journals on databases, statistics, machine learning, and data visualization.
References to such sources are listed at the end of the book.
Popular textbooks on database systems include: Database Systems: The Complete
Book by Garcia-Molina, Ullman, and Widom [GMUW08]; Database Management Systems
by Ramakrishnan and Gehrke [RG03]; Database System Concepts by Silberschatz,
Korth, and Sudarshan [SKS10]; and Fundamentals of Database Systems by Elmasri
and Navathe [EN10]. For an edited collection of seminal articles on database
systems, see Readings in Database Systems by Hellerstein and Stonebraker [HS05].
There are also many books on data warehouse technology, systems, and applications,
such as: The DataWarehouse Toolkit: The Complete Guide to Dimensional Modeling
by Kimball and Ross [KR02]; The Data Warehouse Lifecycle Toolkit by Kimball,
Ross, Thornthwaite, and Mundy [KRTM08]; Mastering Data Warehouse Design: Relational
and Dimensional Techniques by Imhoff, Galemmo, and Geiger [IGG03]; and Building
the Data Warehouse by Inmon [Inm96]. A set of research papers on materialized
views and data warehouse implementations were collected in Materialized Views:
Techniques, Implementations, and Applications by Gupta and Mumick [GM99]. Chaudhuri
and Dayal [CD97] present an early comprehensive overview of data warehouse
technology.
Research results relating to data mining and data warehousing have been published
in the proceedings of many international database conferences, including the
ACM-SIGMOD International Conference on Management of Data (SIGMOD), the International
Conference on Very Large Data Bases (VLDB), the ACM SIGACT SIGMOD-SIGART Symposium
on Principles of Database Systems (PODS), the Inter national Conference on
Data Engineering (ICDE), the International Conference on Extending Database
Technology (EDBT), the International Conference on Database Theory (ICDT),
the International Conference on Information and Knowledge Management (CIKM),
the International Conference on Database and Expert Systems Applications (DEXA),
and the International Symposium on Database Systems for Advanced Applications
(DASFAA). Research in data mining is also published in major database journals,
such as IEEE Transactions on Knowledge and Data Engineering (TKDE), ACM Transactions
on Database Systems (TODS), Information Systems, The VLDB Journal, Data and
Knowledge Engineering, International Journal of Intelligent Information Systems
(JIIS), and Knowledge and Information Systems (KAIS).
Many effective data mining methods have been developed by statisticians and
introduced in a rich set of textbooks. An overview of classification from a
statistical pattern recognition perspective can be found in Pattern Classification
by Duda, Hart, and Stork [DHS01]. There are also many textbooks covering regression
and other topics in statistical analysis, such as Mathematical Statistics:
Basic Ideas and Selected Topics by Bickel and Doksum [BD01]; The Statistical
Sleuth: A Course in Methods of Data Analysis by Ramsey and Schafer [RS01];
Applied Linear Statistical Models by Neter, Kutner, Nachtsheim, and Wasserman
[NKNW96]; An Introduction to Generalized Linear Models by Dobson [Dob90]; Applied
Statistical Time Series Analysis by Shumway [Shu88]; and Applied Multivariate
Statistical Analysis by Johnson and Wichern [JW92].
Research in statistics is published in the proceedings of several major statistical
conferences, including Joint Statistical Meetings, International Conference
of the Royal Statistical Society and Symposium on the Interface: Computing
Science and Statistics.
Other sources of publication include the Journal of the Royal Statistical
Society, The Annals of Statistics, the Journal of American Statistical Association,
Technometrics, and Biometrika.
Textbooks and reference books on machine learning and pattern recognition
include: Machine Learning by Mitchell [Mit97]; Pattern Recognition and Machine
Learning by Bishop [Bis06]; Pattern Recognition by Theodoridis and Koutroumbas
[TK08]; Introduction to Machine Learning by Alpaydin [Alp11]; Probabilistic
Graphical Models: Principles and Techniques by Koller and Friedman [KF09];
and Machine Learning: An Algorithmic Perspective byMarsland [Mar09]. For
an edited collection of seminal articles on machine learning, see Machine Learning,
An Artificial Intelligence Approach, Volumes 1 through 4, edited by Michalski
et al. [MCM83, MCM86, KM90, MT94], and Readings in Machine Learning by Shavlik
and Dietterich [SD90].
Machine learning and pattern recognition research is published in the proceedings
of several major machine learning, artificial intelligence, and pattern recognition
conferences, including the International Conference on Machine Learning (ML),
the ACM Conference on Computational Learning Theory (COLT), the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), the International Conference
on Pattern Recognition (ICPR), the International Joint Conference on Artificial
Intelligence (IJCAI), and the American Association of Artificial Intelligence
Conference (AAAI). Other sources of publication include major machine learning,
artificial intelligence, pattern recognition, and knowledge system journals,
some of which have been mentioned before. Others include Machine Learning (ML),
Pattern Recognition (PR), Artificial Intelligence Journal (AI), IEEE Transactions
on Pattern Analysis and Machine Intelligence (PAMI), and Cognitive Science.
Textbooks and reference books on information retrieval include Introduction
to Information Retrieval by Manning, Raghavan, and Schutz [MRS08]; Information
Retrieval: Implementing and Evaluating Search Engines by Buttcher, Clarke,
and Cormack [BCC10]; Search Engines: Information Retrieval in Practice by Croft,
Metzler, and Strohman [CMS09];Modern Information Retrieval: The Concepts and
Technology Behind Search by Baeza-Yates and Ribeiro-Neto [BYRN11]; and Information
Retrieval: Algorithms and Heuristics by Grossman and Frieder [GR04].
Information retrieval research is published in the proceedings of several
information retrieval and Web search and mining conferences, including the
International ACM SIGIR Conference on Research and Development in Information
Retrieval (SIGIR), the International World Wide Web Conference (WWW), the ACM
International Conference on Web Search and Data Mining (WSDM), the ACM Conference
on Information and Knowledge Management (CIKM), the European Conference on
Information Retrieval (ECIR), the Text Retrieval Conference (TREC), and the
ACM/IEEE Joint Conference on Digital Libraries (JCDL). Other sources of publication
include major information retrieval, information systems, and Web journals,
such as Journal of Information Retrieval, ACM Transactions on Information Systems
(TOIS), Information Processing and Management, Knowledge and Information Systems
(KAIS), and IEEE Transactions on Knowledge and Data Engineering (TKDE).
|