Two Crows Corp. logo

 Home  |  About Data Mining  |  Publications  |  Seminars  |  Consulting  |  About Two Crows 

Mining For Gold

A raft of data mining tools offers a wide range of features for digging up business opportunities. Here's how to find the best product for you.

By Herb Edelstein
Information Week: April 21, 1997
Copyright 1997 CMP Media, Inc.

What is the profitability of your customers? Which products are normally sold together? Which customers are likely to jump ship? These are common business questions, but the answers aren't easy to find-unless you mine your customer data. Data mining tools use statistical and machine-learning methods to search databases for patterns that describe relationships in the data or predict future values or behavior. At least 50 data mining products have been released, and new entries are arriving at a rapid rate. But how do you differentiate among these products with their sometimes unfamiliar technologies? You can't categorize data mining tools into simple labels such as high-end or low-end because the products are too rich in functionality to be divided along just one dimension. Here are some basic evaluation criteria you can apply to help select the right data mining tool. Also, here's a look at the features of 11 products that represent the spectrum of data mining products available. Data mining tools model the database to find relationships in the data. First, the business problem is identified. For example, you might want to find patterns to help you retain good customers. So you further dissect your problem into two questions: What is the profitability of customers-and which customers are likely to leave? (This second occurrence is called attrition.) Next, you identify the types of models you need to answer your questions. Classification, regression, and time-series models are primarily useful for generating predictions, while cluster, association, and sequence-discovery models are primarily useful for describing behavior that's captured in your database. You might build a regression model to forecast profitability and a classification model to predict attrition. There is often some confusion in terminology here; rather than referring to these as types of models, people sometimes refer to them as types of problems. But it's preferable to use the term "problem" for referring to the business problem.

Next, choose the technology to build the model. In this case, you might use a neural net to perform the regression and a decision tree to do the classification.

Finally, the technology used to construct the model is implemented through a product-specific algorithm. Each product will have different implementations of a particular technology.

The first thing to evaluate about a data mining product, then, is what model types it builds. Classification models solve business problems such as identifying who is most likely to respond to a direct-mail solicitation. The tools create classification models by examining already-classified data (cases) and inductively finding the predictive pattern.

The cases may come from a historical database, such as customers who have moved to a new long-distance carrier. Alternatively, the cases may come from an experiment in which a sample of the entire database is tested in the real world, and the results are used to place cases in classes.

Regression models use series of existing values to forecast what other values will be. Because quantities such as sales volumes, stock prices, and failure rates are difficult to predict (because of nonlinear relationships as well as their dependence on complex interactions of multiple variables), technologies such as neural nets are often used to build regression models.

Like regressions, time-series forecasting uses series of existing values. But time-series forecasting also takes into account the distinctive properties of time-such as holidays, date arithmetic, and special considerations such as rolling averages-to forecast future values based on past values.

Clustering models segment a database into different groups whose members are very similar. Unlike classification, you don't know what the clusters will be when you start, or on what attributes the data will be clustered. Consequently, a user who is knowledgeable in the business needs to interpret the clusters.

Association models find items that occur together in a given event or record. Association tools discover rules of the form: If item A is part of an event, then X% of the time (the confidence factor), item B is part of the event. For example, an association model might discover that if a customer purchases corn chips, then 65% of the time that customer also purchases cola-unless there's a promotion, in which case the customer buys cola 85% of the time.

Interest in association analysis has increased with the widespread use of checkout scanners, which let retailers gather transaction detail. That's why association analysis is sometimes called market-basket analysis.

Sequence discovery is similar to association analysis, except that the relationships among items are spread over time. In fact, most data mining products treat sequences simply as associations in which the events are linked by time.

In order to find these sequences, you must have captured not only the details of each transaction but also the identity of the transactors. For example, if surgical procedure X is performed, then 45% of the time infection Y occurs within 5 days.

Sequence discovery can also take advantage of the elapsed time between transactions that make up the event. For example, it could find that after five days, the likelihood of infection Y drops to 4%.

Problem Complexity
The complexity of the problem will determine how difficult it is to extract meaningful relationships from the data. Problems increase in complexity as the amount of data increases. Other contributors to problem complexity are the level of interaction among variables being examined and nonlinearity in the variables and parameters.

Furthermore, as the patterns become more subtle and the need for accuracy rises, finding the right patterns becomes more difficult.

Products deal with complexity by providing a diversity of model types and algorithms, tools for selecting and transforming the data, tools for validating the results, and scalable architectures.

Using combinations of models can often help find useful patterns in data. For example, using clustering to segment a data set and then developing predictive models for the segments may produce more accurate predictions than building a single model on the whole data set.

Many problems, especially those involving classification, will reveal their patterns differently to different algorithms. This could be caused by the type of data or the nature of the patterns themselves. A product that provides alternative algorithms for building a model is more likely to be able to handle complications.

A number of validation methods can be used when training (or estimating) a model. Sophisticated statistical methods such as n-fold cross validation or bootstrapping can help maximize accuracy while controlling overfitting -- the model only works on the data used to train it. However, those methods are processing-intensive, so it will take some time to validate the model.

Often, the patterns in the data are obscured by multiple attributes (or columns) for each data instance that are slightly different measures of the same thing -- for example, age and date of birth. Other columns may be irrelevant.

The way the data is represented in the columns can also affect the tool's ability to find a good model. It's important for products to deal with data complexity by providing tools to guide you in selecting appropriate columns and transforming their values. Unfortunately, few products effectively help you select appropriate columns.

Scalability depends on how effectively the tool takes advantage of powerful hardware in order to deal with large amounts of data-both rows and columns-and with sophisticated validation techniques.

Examine the kinds of parallelism the tool supports. Does it use a parallel database management system, and are the algorithms themselves parallel? What kind of parallel computers does it support -- symmetric multiprocessing servers or massively parallel processing servers? How well does it scale as the number of processors increases? Does it support parallel data access?

It's important to recognize, however, that the size of the machine on which a product runs is not a reliable indicator of the complexity of problems it can address. Very sophisticated products that can solve complex problems run on a range of computers from a desktop PC to a large MPP system in a client-server architecture. Quite possibly all the system architecture may tell you is something about the amount of data that can be mined and the price of the product.

People Skills
Data mining products should be used only by someone who is familiar with the data, the application, and model building. Remember that skill levels vary-an analyst may know the business and how to use the tool but is not particularly knowledgeable in model building.

To facilitate model building, some products provide a graphical user interface for semiautomatic model building, while others provide a scripting language. Some products also provide data mining APIs that can be embedded in a programming language such as C, Microsoft's Visual Basic, or Powersoft's PowerBuilder.

Because of important technical decisions in preparing the data and selecting modeling strategies, even a GUI that simplifies the model building requires expertise to find the most-effective models.

A model may be deployed by running it against an existing collection of data or against new cases as they come in. Some tools provide a simple GUI for executing the model, which lets a business analyst explore the patterns. Other tools can export the model to a program or a database, for example, using a set of rules in SQL or a procedural language such as C.

Basic Capabilities
To some degree, all data mining products address the basic areas that follow.

Data preparation: Preparing the data is the most time-consuming aspect of data mining. Anything a tool can do to ease this process will greatly expedite the development

of your model. Some functions to look for include data cleaning, data description, data selection, and data transformation.

Data access: Some data mining tools require you to extract the data from target databases into their internal file format, whereas others can work directly against the database. A scalable data mining tool that can directly access the data mart's database server using its native SQL can maximize performance and take advantage of special server features, such as parallel database access.

Performance: Speed and accuracy both contribute to performance. Speed is measured in how fast a model is built, as well as how fast a deployed predictive model can evaluate new data.

Accuracy is measured in the error rate of the algorithm for predictive modeling. Small differences in accuracy-96% vs. 97%-may not be significant. Accuracy is affected by noise, which is the result of irrelevant columns, missing or incorrect values, or cases that don't conform to the underlying pattern you're trying to find. How much of this noise can your model-building tool stand before its accuracy drops?

Model evaluation and interpretation: Some technologies produce results that are relatively easy to interpret. For example, decision trees can express their results as rules. Other technologies may make good predictions, but they may be difficult to understand.

Products can help the user understand the results by providing measures (for example, for accuracy and significance), by allowing the user to perform a sensitivity analysis on the result, and by presenting the result in alternative ways, such as graphically.

Interfaces to other products: Many tools-including traditional query and reporting tools, graphics and visualization tools, and online analytical processing tools-can help you both understand your data before you build your model and interpret the results after your model is deployed. Data mining software that provides an easy integration path with other vendors' products provides the user with additional ways to get the most out of the knowledge discovery process.

The Products
No matter how comprehensive the list of capabilities and features you develop for describing a data mining product, nothing substitutes for actual hands-on experience. All cars have four wheels and an engine, but they certainly feel different on the road. While feature checklists are an essential part of the purchase decision, they can only rule out certain products. Only use of a product in a pilot project can give you the information to determine the right product for you.

KnowledgeSeeker
KnowledgeSeeker from Angoss Software Corp. in Toronto is a desktop or client-server tool that uses decision trees for predictive models. A version of CHAID (Chi Squared Automated Interference Detection) is used to predict categorical variables, and CART (Classification and Regression Trees) is used for continuous variables.

Angoss provides a nice GUI for building the model and interactive facilities that let the user explore the data by splitting a selected node in the tree or even forcing a particular split that might be of interest. Angoss users can also deploy the model by exporting the discovered rules as text or as SQL or Prolog languages. Another option is to use a C API to navigate the tree directly.

In addition to being used by itself, Angoss is often used to determine significant database columns that are then fed to statistical analysis tools for building models with logistic regression.

DataCruncher
DataCruncher from DataMind Corp. in San Mateo, Calif., is designed for customer attrition or churn problems, with a particular emphasis on the telecommunications industry.

DataCruncher is a client-server tool that uses a proprietary model-building technique called Agent Network Technology to build tree-like classification models on complex data sets, although it doesn't yet support parallelism.

A GUI that runs under Microsoft Excel automates much of the model-building process. Data Cruncher also uses Excel graphs to help visualize the results of a model. This interface is intended to let business technologists build and refine their own models.

Red Brick Systems in Los Gatos, Calif., has integrated a version of DataCruncher into its Red Brick Warehouse database.

DataBase Mining Marksman
DataBase Mining Marksman from HNC Software Inc. in San Diego is designed for database marketing applications and is unusual because it's sold as a combination of hardware and software. The hardware component is a standard PC with an accelerator board containing 16 parallel processors, allowing Marksman to quickly and automatically build many neural nets with different architectures in order to select the best.

Marksman discovers relationships between attributes by computing relationship strengths between all pairs of fields. This is useful for exploring data and identifying highly correlated columns. It also builds classification models with feed-forward back-propagation neural networks.

Marksman provides excellent data-preparation features and extensive reporting and analysis tools to support direct marketing.

//Discovery
The //Discovery suite (pronounced "Parallel Discovery") from HyperParallel Inc. in San Francisco is a comprehensive set of data mining tools for classification, regression, clustering, association, and sequencing.

HyperParallel provides a command-line interface designed to be used only by trained HyperParallel personnel in building custom applications. Consequently, the company sells its product set bundled with its services.

The components of each new application are incorporated in an internal library, called //Solutions Framework, which provides the building blocks for future implementations of similar applications.

Version 1.0 product components include //Affinity for associations and sequencing; //Induction, a decision-tree classifier also used for value prediction; and //Cluster, a clustering tool. All components exploit parallel processing on both SMP and MPP Unix platforms.

Intelligent Miner
IBM's Intelligent Miner is a comprehensive set of data mining tools for classification, association and sequence discovery, time series, clustering, and regression. IBM supplies multiple technologies for classification (decision tree and neural net), and clustering (demographic and neural net); most of the algorithms have been parallelized for scalability.

Models can be built using either a GUI or an API. Although the first release of the GUI is a little rough, IBM has done some nice work in setting defaults. I was impressed by the preview I saw of the GUI for the next release.

Intelligent Miner is tightly coupled with DB2, which also must be installed, but Miner supports input from sources such as ASCII files.

Decision Series
Decision Series from NeoVista Solutions Inc. in Cupertino, Calif., is a comprehensive set of data mining tools that provide different models and technologies.

DecisionNet is a back-propagation neural network tool, employing a proprietary technique closely related to radial basis functions, and can be used for classification and regression. DecisionAR can be used for association and sequencing discovery. DecisionCL is used for clustering, employing a proprietary technique closely related to k-means. DecisionAccess provides data-preparation functions, and Decision Series accesses data in ASCII files.

Like HyperParallel, NeoVista offers Decision Series only in a bundle with its knowledge engineering services. Consequently, NeoVista does not yet offer a GUI for model development, although the company develops a custom GUI for most applications.

4Thought
4Thought from Cognos Inc. in Burlington, Mass., is designed for building regression and time-series models, although it also can be used for classification. It uses neural nets to build these models.

A spreadsheet-style interface reflects 4Thought's orientation toward business analysts, although they will have to be comfortable using spreadsheet functions in complex ways.

The deployment and model-analysis capabilities of 4Thought are extensive, and it does a particularly strong job of time-series analysis. Predicted values can be exported in a number of formats, and a model can also be exported as a function to Excel, Lotus 1-2-3, and the SPSS statistical product.

Data Mining Solution
Data Mining Solution from the SAS Institute Inc. in Cary, N.C., is an SAS System module for data mining analysis. SAS provides a GUI with an extensive set of options for building the model.

The current version of the Data Mining Solution includes the SAS Neural Network Application and the SAS Decision Tree Application, for building CHAID-based decision trees. Before generating the model, you can explore the data using the SAS/ Insight visualization tool.

A more extensive and better-integrated version of Data Mining Solution will enter beta testing in June. This version will include association discovery. A particularly useful feature of the updated version is that it's integrated with a data mining methodology that helps users integrate the steps of knowledge discovery.

MineSet
MineSet from Silicon Graphics Inc. in Mountain View, Calif., is a set of data mining tools that combine classification and association algorithms with visualization. One of MineSet's most notable features is its integration of data mining analytic tools with high-end visualization tools for user exploration and navigation of data sets and mining results.

The product's data mining tools include the Association Rule Generator, Decision Tree Inducer for classification, Evidence Inducer, and Column Importance determination utility. All the algorithms, except association, are based on MLC++ technology developed at Stanford University.

Visualizers, such as the Map Visualizer and the Scatter Visualizer, work directly with data to explore relationships and trends. Models are presented to the user for analysis using 3-D visualization. Model results are displayed with visualizers such as the Tree Visualizer, which lets the user explore a decision tree model by virtually flying over it. The tools present several dimensions of data simultaneously by using color, size, and animation. The visualizers support filtering, querying, rotation, zooming, and panning, and can be customized by the user.

Darwin
Darwin from Thinking Machines in Bedford, Mass., is a set of tools oriented toward classification and regression.

StarTree builds decision trees using a CART-based algorithm. StarNet creates models using neural networks. Five possible training algorithms are available in StarNet: back-propagation, steepest descent, conjugate gradient, modified Newton, and a genetic algorithm called StarGene.

StarMatch produces models using the k-nearest neighbor algorithm. A fourth component, StarData, is used for data preparation and analysis. A scripting facility lets users record and play back data mining analyses.

Darwin has a GUI for building models. The algorithms are parallelized for handling large amounts of data. Darwin's use of multiple algorithms for model building broadens the range of problems for which the product can effectively build models.

Pattern Recognition Workbench
Pattern Recognition Workbench from Unica Technologies in Lincoln, Mass., is a set of tools for building classification, clustering, time-series, and regression models. In addition to building models with back-propagation neural networks, it provides algorithms for logistic regression and linear regression. It also builds clustering models using k-means.

The toolset is the basis of Model 1, a package for database marketing modeling and analysis sold by Group 1 Software and Unica. PRW provides a GUI based on a spreadsheet-style interface. The data must be brought into one or more spreadsheets, after which the data is prepared for mining with the product's extensive set of functions.

PRW will automatically generate alternative models and search for a best solution. It also provides a variety of visualization tools to monitor model building and interpret results. A model can be deployed as a spreadsheet function, as a dynamic link library, or as C code.

Herb Edelstein is president of Two Crows Corp., a data-mining consulting firm in Potomac, Md., which has produced the report "Data Mining: Products, Applications, and Technologies." He can be reached at herb@twocrows.com.

http://www.informationweek.com

Top of page

 Home  |  About Data Mining  |  Publications  |  Seminars  |  Consulting |  About Two Crows