Database Management Systems II - Prof. Holowczak
Zicklin School of Business - Baruch College
City University of New York
Database Management Systems II
Data Mining
What You'll Learn This Week
Chapter 26 in Elmasri/Navathe (3rd edition)
Data Mining
Some materials from:
"Data Mining Techniques for Marketing, Sales and
Customer Support" by Michael Berry and Gordon Linoff.
Wiley Computer Publications, 1997.
"The KDD Process for extracting useful knowledge from
volumes of data" by U. Fayyad, G. Pietetski-Shapiro
and P. Smyth. Communications of the ACM, Volume 39,
Number 11. November, 1996
Actually, this entire CACM issue is an excellent
introduction to the process and the research issues
associated with data mining.
- Data Mining - Data mining is the
analysis of data (any data) for relationships
that have not previously been discovered.
- Database Mining - Data mining in
the context of a database management system.
- Database Marketing - The transformation of
databases into business decisions. Data mining
is simply a tool - Database Marketing is a
business application of data mining.
- Recall in a data warehouse, we have abstracted
and summarized data. So we already know the
relationships among data used for summaries,
etc.
- Data Mining attempts to discover implicit
relationships or patterns in the data that
were previously not known.
- Another definition: Data Mining is the
non-trivial process of identifying
valid, novel, potentially useful and
ultimately understandable patterns in data
[Fayyad, Pietetski-Shapiro, Smyth, CACM, 1996].
- Ideally, the knowledge gained should be
actionable - that is, from the
analysis of the DM results, we should be
able to take some action. For example,
which of the following are actionable ?
- A customer who has switched long distance
carriers in the past is likely to do so
again.
- A customer who has switched long distance
carriers 6 months ago is likely to
switch again within the next month.
- Customers who purchase pants, a sport coat
and two shirts are likely to also purchase
men's cologne.
- If a customer purchased $100 worth of
goods from us before, they are likely to
purchase $100 worth of goods again.
- If a customer calls technical support
more than 3 times during the first
2 weeks of owning a computer, they
are likely to return it.
- Some goals of data mining:
- Find associations between data
items or between attributes (link analysis).
- Classify data into categories.
- Cluster data according to their
associations.
- Forecast future events based on
existing data.
- Summarize by providing a
compact description of a subset of
data.
- Detect sequences or
patterns in data.
- Tradeoff between accuracy and
coverage.
- Accuracy indicates the percentage of time
the discovered associations,
classifications, patterns or summarizations
reflect the underlying data.
- Coverage indicates for which portion of the
data the discovered associations, etc.
apply to.
- For example, we might discover a pattern
that couples who live and work in Manhattan
and own 2 cars make more than $50,000 a year.
This can only be said for folks in Manhattan (limited
coverage) - e.g., it says nothing about people
in St. Louis.
This will not hold true for all
who live/work in Manhattan (limited accuracy).
- Data Mining is the combination of three areas
of research: Databases, Statistics and
Artificial Intelligence. Their roles are:
- Databases: Data must be prepared and
cleaned for mining.
- AI: Algorithms to detect patterns,
classify and cluster come from the AI
research community. For example:
Neural networks, clustering
algorithms.
- Statistics: Mathematics behind
association and pattern analysis. e.g.,
regression or time series analysis.
Data Mining has a negative connotation in
the statistics world - perhaps an abuse of
statistics to find statistically significant
patterns in data that are not relevant
or valid.
- Is data Mining a new thing ?
Yes and No.
No, because MIS has always been able to run
some queries that gather statistics about
data. Also, research community has been
running "DM-like" experiments for decades
Yes, in the form of database marketing because:
- Data is being collected at unprecedented
rates
- Data is being stored for extended period
of time (see data warehouse)
- Computing power and storage have
increased dramatically
- Companies are looking for a competitive
edge.
- Some example applications of data mining
(Summarized from Brachman, et. al.
"Mining Business Databases", CACM, 1996):
- Marketing: Use predictive modeling to
determine which customers are likely
to buy a given product or service, or
which would respond to a given
promotion.
Market basket analysis: Use point
of sale data to support decisions on
product shelf allocation, store
layout or promotion effectiveness.
- Finance: Using predictive models
to guide investment strategy.
Fraud detection: Detect abnormal trading
behavior, fraudulent use of credit cards
or calling cards.
- Telecommunications: Churn analysis.
Identify those customers who are likely
to change long distance or cellular
carriers.
- An outline of the Data Mining Process:
- Problem description / determination.
This includes identifying the business
need/problem or opportunity.
- Identification of source data
- Preparation of source data (cleaning,
integrating)
- Data Mining algorithm selection.
Choosing the appropriate algorithm
or model for the data mining task.
- Perform the data mining.
- Interpret the results.
- Act on the interpretation.
- Two basic approaches to data mining:
- Hypothesis testing (Verification driven)
- Knowledge Discovery (discovery driven)
Techniques for Hypothesis Testing
(Verification-Driven)
- Hypothesis testing is considered a "Top
Down" approach:
- State a hypothesis about your data, your
business, etc.
- Identify and prepare data that can be
used to test the hypothesis
- Build a model (set of inputs, outputs and
operations) based on the data to test the
hypothesis.
- Evaluate the model to see if it confirms
the hypothesis (e.g., is the hypothesis
verified ?).
- Hypothesis testing is useful in cases where
management suspects something about the
business and can act if the suspicion holds
true.
The data and models either support or
do not support this suspicion.
- Example:
- Hypothesis: Customers that purchase
products in our store and redeem coupons
will tend to
have lower incomes than customers who
purchase products and do not redeem
coupons.
- Data Sources:
Each coupon is tagged with a unique
number.
Mailing list for customers across the
country who received coupons.
Census data showing median incomes per
zip code.
Sales data indicating the store, product
purchased and coupon number redeemed (if
any).
- Model: A linear model predicting the use
of a coupon given the median income (from
zip the code).
- Evaluation: Run the data through a
statistical package such as SAS or SPSS
and examine the output. e.g., goodness of
fit, sum of errors, etc.
- Big problem is missing data and
accuracy of data.
- Advantages: Lots of experience with
hypothesis testing - well known techniques.
Lots of statistical tools to back it up.
Answers in mathematical terms.
- Disadvantages: Must come up with
a good hypothesis - it might be difficult to
come to consensus as to exactly what the
problem is, etc.
Missing and erroneous data
can skew the results.
Now that you have the results, what action
can be taken (this is an implicit part of
forming the hypothesis) ?
Techniques for Knowledge Discovery
(Discovery-Driven)
- Knowledge discovery is a "Bottom-up" approach.
- Two main approaches:
- Directed: Given a target data
item, discover a way to estimate,
classify or predict its value given other
data items.
- Undirected: Given a set of data
items, discover rules to classify,
estimate or predict their values or
associations.
- We might use a hybrid of the two: First use
undirected to discover some
relationships. Then use directed to
build a model to predict future instances.
- Directed discovery works as follows:
- Identify data sources. Data should be
"pre-classified".
- Prepare existing data (cleaning,
normalizing, etc.) and/or derive new data
that is more meaningful to the business.
e.g., instead of keeping both gallons of
gasoline purchased and mileage, convert
to miles per gallon and just store
that.
- Divide data into training, testing and
evaluation sets.
Training data is used to build (train)
the models.
The testing set is used to gain
experience with the model and allow
changes.
The evaluation set is then used to gauge
how effective the model is.
- Build and train the model. A model is
posed in terms of a set of input
variables and output variables.
Some techniques: Neural networks and
decision trees.
- Test the model with the testing set of
data. This step serves to "broaden" the
model and make it more applicable to the
data in general while preserving the
accuracy of the model. We say such models
are more robust.
- The final step is to evaluate the model
using the evaluation data set.
- Undirected discovery follows the same
basic set of steps, however, further
interpretation of the results must be done.
Market Basket Analysis
- Market Basket Analysis is a data mining
technique that examines which items (or
services) are likely to be purchased together.
- In a retail environment, customers may not be
uniquely identified. Consider: Grocery store,
7-11, etc.
- The only information we have is what products
were purchased together at a given time for a
large number of customers.
- Analysis of this information can lead us to:
- Suggest how to lay out the items in a
store.
- Suggest which products to promote with
coupons or sales/specials.
- Give insight into brand loyalty,
co-branding and brand name recognition.
- Looking at the dimensions of such data:
- Time - the day and time the goods were purchased
- Products - The basked of products purchased
together
- Location - The store location where the products
were purchased
- Demographics associated with the store location
- Using undirected or directed knowledge discovery,
we can discover association rules
for products. For example:
- If tortilla chips are purchased on a
Friday or Saturday, then it is likely that
salsa will also be purchased.
- Customers who sign up for voice mail service
are likely to also sign up for call waiting.
- Customers who purchase a shovel or a garden hoe
in May are likely to also purchase gardening
gloves.
- Example:
| Customer | Purchases
|
|---|
| 1 | Top soil, gloves
|
| 2 | Pruning shears, Top soil, flower seeds
|
| 3 | Top soil, weed killer
|
| 4 | Top soil, weed killer, gloves
|
| 5 | Flower seeds, gloves
|
- Form a co-occurrence matrix:
| Top Soil | Seeds | Shears | Gloves | Weed Killer
|
| Top Soil | 4 | 1 | 1 | 2 | 2
|
| Seeds | 1 | 2 | 1 | 1 | 0
|
| Shears | 1 | 1 | 1 | 0 | 0
|
| Gloves | 2 | 1 | 0 | 3 | 1
|
| Weed killer | 2 | 0 | 0 | 1 | 2
|
- Some rules to note:
- Pruning shears and gloves are never
purchased together
- Weed killer and Flower seeds are never
purchased together
- Gloves and Top soil are most likely to be
purchased together
- We can write these are more formal association rules
in the form of IF condition THEN result
- We can consider the support, confidence
and improvement for such rules:
- Support: The percentage of records containing
the itemset.
- Confidence: A measure of support given the
condition is true
- Improvement: The confidence given that
the condition and the result. e.g.,
Confidence / (Support for Condition * Support for Result)
- Example:
If a customer purchases Top Soil then they are
likely to also purchase Gloves
- Top Soil is purchased 80% of the time (4 out of 5 transactions)
Gloves are purchased 60% of the time (3 out of 5 transactions)
Top Soil and Gloves are purchased together 40% of the time (2 out of 5 transactions)
- Support: 40%
This occurs in 2 out of 5 transactions (2/5)
- Confidence: 40% / 80% = 50%
Top Soil is purchased by itself in 80% of the transactions (4/5)
while the combination of Top Soil and Gloves only
appears in 40% of the cases.
- Improvement: 83%
40% / (80% * 60%)
- So, what marketing/product placement strategies can
you come up with ?
- In practice, any rule with improvement less than 100%
is likely not a good candidate for cross promotion.
- Other examples: Options ordered with PCs or cars,
Telephone Services (call waiting, forwarding, etc.), etc.
- Market Basket Analysis is relatively
straight forward to implement and understand.
Also supports undirected data mining.
- However, as data size grows large, it becomes
computationally intensive. Outliers or
"rare events" are obscured. Also, the types
of problems that can be addressed are limited.
- Computer science research looks into more
efficient ways of computing large itemsets in
more than 2 dimensions.
-
Decision Trees
- Decision trees are used for classification
prediction.
- A tree is a set of nodes and branches. Each
node has at most one parent node connected by
a branch.
Each node can have none, one or more branches
connecting to child nodes.
A node with no children is a leaf node.
The node with no parent is the root node.
- Typically, each node represents a variable
(attribute) in our data and the outward links
(to child nodes) represent possible values for
the variable.
- Assume we have the following denormalized
table Customer Purchase:
| Name | City | State | Zip | Phone | Date of Last Purchase | Purchase Amount | Coupon Redeemed | Avg. Salary in Zip
|
|---|
| Smith | Bound Brook | NJ | 08102 | 732-555-0000 | 10-JAN-1998 | $120 | No | $40,000
|
| Green | Edison | NJ | 08211 | 732-555-0001 | 13-JAN-1998 | $80 | Yes | $45,000
|
| Jones | NYC | NY | 10010 | 212-555-0000 | 27-JAN-1998 | $320 | Yes | $65,000
|
| Red | NYC | NY | 10011 | 212-555-1000 | 28-JAN-1998 | $120 | No | $95,000
|
| Brown | Short Hills | NJ | 08555 | 201-555-1111 | 22-JAN-1998 | $400 | No | $86,000
|
- Can you find some decision rules that would
allow you to predict future use of a coupon
depending on one or more other attributes ?

- Interpret rules as follows: Traversing a
branch adds an "AND" to the rule. e.g.,
IF State=NJ AND
DateLastPurchase=EndOfMonth AND
AverageSalary < $50k THEN Coupon
- This is a simple tree for just a small amount
of data. The key is to be able to generate the
tree and the rules automatically from a
collection of training data.
- Use a decision tree algorithm such as:
- CART - Classification and Regression Trees
- CHAID - Chi-Squared Automatic Interaction
Detection
- ID3 - Iterative Dichotomiser 3 (J. Ross
Quinlan)
- C4.5 - Latest decision tree algorithm by
J. Ross Quinlan
- Such algorithms create the decision tree
automatically by creating simple rules
subdivide the data.
- For example, Consider:
Age, ZipCode, Gender, HighestDegree
First node splits on Gender, next splits on
HighestDegree, next on range of Age, etc.
- Decision trees are useful because they
generally produce a set of understandable
rules and clearly indicate which data items
are most important for classification. They
can handle both continuous (using ranges) and
discrete (categorical) data types and perform
the classification task without requiring a
large amount of computation.
- Decision trees are not appropriate for
predicting continuous values or for time
series data. Decision trees also require
training which may be computationally
intensive.
Link Analysis
- Link analysis is used to find links (causal
relationships) between data points.
- Based on Graph Theory (nodes and weighted
arcs)
- Example: Which telephone numbers are used for
Voice, Fax and Internet ?
- Other examples:
Package Delivery (Source address, Destination
address, package weight, cost, date picked up,
date delivered, etc.)
Physician Referrals (Source physician,
Referral Physician, Diagnosis, Cost, etc.)
Airline flights, etc.
- Link Analysis is good for visualizing
relationships between data instances and items
and can help to generate new characteristics
not evident in the original data.
- However, many types of data do not lend
themselves to link analysis and few tools
support it.
Other Techniques
Many more - could fill an entire (maybe 2) courses:
- Neural Networks - a set of input, outputs and
"neurons" in the middle that "learn" patterns
of data based on training and can then "fire"
outputs to classify new data.
- Genetic Algorithms
- Memory Based Reasoning
- Cluster Detection
File: dm_index.html Date: Thu Mar 9 23:20:56 EST 2000
All materials Copyright, 1997-2000 Richard Holowczak