Two Crows Corp. logo

 Home  |  About Data Mining  |  Publications  |  Seminars  |  Consulting  |  About Two Crows 

Debunking Data Mining Myths

Don't let contradictory claims about data mining keep you from improving your business

By Robert D. Small
Information Week: January 20, 1997
Copyright 1997 CMP Media, Inc.

A great deal of what is said about data mining is incomplete, exaggerated, or wrong. Data mining has taken the business world by storm, but as with many new technologies, there seems to be a direct relationship between its potential benefits and the quantity of often-contradictory claims, or myths, about its capabilities and weaknesses. It's difficult to fight these myths, which are based on misunderstandings, hopes, and fears. The new technology cycle typically goes like this: Enthusiasm for an innovation leads to spectacular assertions. Ignorant of the technology's true capabilities, users jump in without adequate preparation or training. Then, sobering reality sets in. Finally, frustrated and unhappy, users complain about the new technology and urge a return to "business as usual." When you undertake a data-mining project, avoid a cycle of unrealistic expectations followed by disappointment. Understand the facts instead, and your data-mining efforts will be successful. Simply put, data mining is used to discover patterns and relationships in your data in order to help you make better business decisions.

Myth: Data mining produces surprising results that will utterly transform your business.

Fact: Most often, the results of data mining yield steady improvement to an already successful organization, often contributing important incremental changes rather than revolutionary ones.

Nevertheless, data mining can lead to significant change in several ways. First, it may give the talented business manager a small advantage each year, on each project, with each customer. Compounded over a period of time, these small advantages turn into a large competitive edge. For example, a catalog retailer that can better target its mailing list can increase profits by reducing the cost of mailings while increasing the number of orders. Over time, this can result in a substantially more profitable business.

Second, data mining occasionally does uncover one of those rare "breakthrough" facts, such as scientists' noticing the association between the fatal Reyes Syndrome and children taking aspirin.

In short, data mining is a powerful search tool for forward-looking companies.

Myth: Data-mining techniques are so sophisticated that they can substitute for domain knowledge or for experience in analysis and model building.

Fact: No analysis technique can replace experience and knowledge of the business and its markets. On the contrary, data mining makes education and experience in many areas more important than ever.While experts may need to learn new analytical techniques to stay current and make leading-edge contributions, someone who's an expert only in analytical techniques, without having knowledge of the business, is of no help.

Experience in building models, however, can ensure more profitable use of data mining, since data mining is simply the newest tool for building models.

The less domain knowledge a data mining expert brings to a problem, the more important it is to perform the data mining in close cooperation with people who understand the business.

Similarly, the less skill and experience that business experts have in modeling and using the associated tools, the more help they need from data-mining experts in leveraging their business knowledge.

For example, financial analysts seeking to increase the return on their clients' investments may ask an expert data miner to analyze a large, complex database on previous clients. The data miner may discover that certain variables predict success in investing, but it takes a financier to know whether it's legal to influence those variables.

Myth: Data-mining tools automatically find the patterns you're looking for, without being told what to do.

Fact: Data mining is most cost-effective when used to solve a particular problem. Although a data- mining tool can indeed explore your data and uncover relationships, it still needs to be directed toward a specific goal. Simply giving a data-mining tool a mailing list and expecting it to find customer profiles that improve the efficiency of a direct-mail campaign is not particularly effective. You need to be more specific in your goals. For example, to improve the value of mailing-list responses, your model might emphasize customers who have previously bought expensive items; to increase the number of responses, your model might emphasize customers who have responded to previous mailings.

Myth: Data mining is useful only in certain areas, such as marketing, sales, and fraud detection.

Fact: Virtually any process from pharmacology to customer service can be studied, understood, and improved using data mining. These techniques are being applied to such diverse applications as manufacturing process control, human resources, and food-service management.

Data mining is useful wherever data can be collected. Of course, in some instances, cost/benefit calculations might show that the time and effort of the analysis is not worth the likely return. For example, suppose you suspect that if you collect just one more piece of information about your customers, you could double the number of orders you received. But you also know that mailing to twice as many people will also double the number of orders. If gathering the data is more expensive than sending the extra mailings, then it makes sense to increase the mailings rather than mine the data.

Myth: The methods used in data mining are fundamentally different from the older quantitative model-building techniques.

Fact: All methods now used in data mining are natural extensions and generalizations of analytical methods known for decades. Neural nets, a special case of projection pursuit regression, were developed in the 1940s. CART (classification and regression trees) methods were used by social scientists in the 1960s. K-nearest neighbor, a form of density estimation, has been used for a half-century.

All these methods--just like regression techniques--model relationships between a set of profile variables and an outcome.

What's new in data mining is that we're now applying these techniques to more general business problems, thanks to the increased availability of data and inexpensive processing power.

Furthermore, because communication between the business community and methodologists, who are mainly academics, has often been poor, there was, until recently, no user-friendly software for implementing these methods. The recent interest in data mining is in part due to the improved user interfaces that make these techniques more available to business experts.

The rise of these powerful methods is a great step forward, but the old tools are still valuable. Varieties of regression techniques, discriminant analysis, and even simple graphs can help reveal hidden patterns. No single method solves all or even a majority of problems. Successful data mining requires a portfolio of tools, both old and new.

Myth: Data mining is an extremely complex process.

Fact: The algorithms of data mining may be complex, but new tools have made those algorithms easier to apply. Often, just the correct application of relatively simple analyses, graphs, and tables can reveal a great deal about our business. Much of the difficulty in applying data mining comes from the same data-organization issues that arise when using any modeling techniques. These include data preparation tasks--such as deciding which variables to include and how to encode them--and deciding how to interpret and take advantage of the results.

Myth: Only massive databases are worth mining.

Fact: It's true that many methods used in data mining were specifically developed for analyzing very large data sets, and that many data-mining applications involve massive data sets. But a moderately sized or small data set can also yield valuable information. For example, buying patterns may depend most strongly on the day of the week or the time of the year. A modest database consisting of only "day" and "sales" could show this pattern, give the retailer some idea of its magnitude, and allow for planning of inventory and staffing.

Even when building a massive database, try out some simple analysis on the data while the database is still moderate in size. You may decide to collect the data differently or to collect different data altogether.

Myth: Data mining is more effective with more data, so all existing data should be brought into any data-mining effort.

Fact: More data items are useful only if they contribute more information about the issues at hand, or goals. Otherwise, they can be worse than worthless. A database may have a great deal of information about an item (or about the relationship between items) but nothing about other items that are actually closely related. For example, a company may have information about how customers use one credit card, but nothing about how those customers use their other credit cards.

However, adding data with little information content can actually lower the predictive power of the database. By including irrelevant data or adding multiple measurements of the same item, the utility of the data-mining results will be reduced. For example, if you include age as well as birth date, the analysis tool will discover that both factors are equally relevant and will therefore assign a lower weight to both measures as predictors.

Myth: Building a data-mining model on a sample of a database is ineffective, because sampling loses the information in the unused data.

Fact: The thrust of almost all developments in the study of sampling is to maximize the amount of information gained per unit of effort expended.

Keep in mind that your data probably already represents a sample of a larger population. When you analyze your customer database to help acquire new customers, you're basing your model on a sample of the total population.

Under some circumstances, you may be forced to sample. Not all your data may be relevant to the problem at hand or reflect the population you're trying to model. Many data warehouses include historical data that reflects conditions--such as unexpired patents--that no longer apply, rendering it inappropriate for building a model to guide future decisions.

Sometimes full-scale data-gathering is not practical. For example, if you'd like to learn about customers' satisfaction with your new product or service, but it takes an hour to administer a customer satisfaction survey, you'll most likely decide to limit your analysis to a sample.

In fact, a relatively small random probability sample, correctly taken, can yield excellent results. Although there are 60 million or more voters in a presidential race, the final poll before the election, which is based on two-thousandths of 1% of those voters, is seldom off by more than 2%. If we had a database of all 60 million voters and hundreds of measurements on each one, we couldn't build a better model for predicting the winner.

Even when it's possible to build the model on the entire database, you may choose not to. It's often a better use of resources to build and evaluate many models using samples of the data, rather than rely on a single model using all the data.

Myth: Data mining is another fad that will soon fade, allowing us to return to standard business practice.

Fact: Although the name may change, data mining as a vital application will not go away. Companies have been using related quantitative techniques in many parts of their businesses for a long time. Data mining is just one more advance in a research process that has been ongoing since the beginning of the 20th century. A recent increase in the power of computers, coupled with cheap electronic methods for capturing large amounts of data, brings us to this step now.

Data mining can't be ignored -- the data is there, the methods are numerous, and the advantages that knowledge discovery brings to a business are tremendous. Companies whose data-mining efforts are guided by "mythology" will find themselves at a serious competitive disadvantage to those organizations taking a measured, rational approach based on facts.

Robert D. Small was VP of Research of Two Crows Corp. in Potomac, Maryland.

http://www.informationweek.com

Top of page

 Home  |  About Data Mining  |  Publications  |  Seminars  |  Consulting |  About Two Crows