Two Crows Corp. logo

 Home  |  About Data Mining  |  Publications  |  Seminars  |  Consulting  |  About Two Crows 

Scalable Data Mining

by
Robert D. Small, Ph.D.
Herbert A. Edelstein

 

Download this paper as a MS Word file or as a Postscript file.

Table of Contents

  1. Introduction
  2. Users Need Scalable Data Mining
  3. Technical Factors Requiring Use of Scalable Data Mining Tools
    1. Database Size
    2. Model-Building Complexity, Performance Monitoring and Tuning
    3. Training and Validation
  4. Making Data Mining Tools Scalable
    1. Hardware Scalability
    2. Parallel Algorithms
    3. Native Database Access and Parallel DBMSs
  5. Conclusion
  6. Scalable Data Mining Requirements
  7. Scalable Data Mining Tools

I. Introduction

Data mining can yield exciting results for almost every organization that collects data on its customers, markets, products or processes. By discovering hidden patterns and relationships in the data, data mining enables users to extract greater value from their data than simple query and analysis approaches.

To discover the hidden patterns in data, we need to build a model consisting of independent variables (e.g., income, marital status) that can be used to determine a dependent variable (e.g., credit risk). Building a data mining model consists of identifying the relevant independent variables and minimizing the predictive error. To identify the model that has the least error and is the best predictor may require building hundreds of models in order to select the best one.

Fortunately, we have reached a point in terms of computational power, storage capacity and cost that enables us to gather, analyze and mine unprecedented amounts of data. While some data mining problems are amenable to a desktop or simple client-server approach, many problems -- due to their size or complexity -- require a scalable data mining product.

Literally, scalability means that as a system gets larger, its performance improves correspondingly. For data mining, scalability means that by taking advantage of parallel database management systems and additional CPUs, you can solve a wide range of problems without needing to change your underlying data mining environment. You can work with more data, build more models, and improve their accuracy by simply adding additional CPUs. Ideally, scalability should be linear or better. For example, if you double the number of CPUs in a parallel system, you can build twice as many models in the same amount of time, or the same number of models in half the time.

The rest of this white paper will examine in detail the reasons scalable data mining solutions are needed and the main methods of achieving scalability. While our discussion will be based on the problem of predicting which groups or classes a case (or row of data) falls into, the need for scalability covers all categories of data mining. We'll make extensive use of real examples drawn from a variety of organizations to illustrate the breadth of applications to which scalable data mining may be applied as well as the universality of the issues.

Top of page

II. Users Need Scalable Data Mining

Fundamentally, the reason for scalable solutions is to be able to build a good data mining model as quickly as possible. This offers two benefits.

First, there is value in being able to deploy and use the model sooner rather than later. In the second example below, finding a good model will immediately increase the payoff of the product promotions included with bills. The credit card vendor doesn't want to let many billing cycles pass without taking advantage of the data mining results.

Second, faster turnaround times yield better models. As we shall see, mining large databases and constructing complex models call for lots of computing power, as do sampling, testing and validating. If analysts have to wait hours or overnight for a model to run, their effort is going to be very different than if they can get a model trained in a few minutes. Time previously spent waiting for results can instead be devoted to finding the model that results in the best, most reliable solution.

For example, if in a few minutes analysts can create a group of models and graph their forecasts, they can pick from a large number of model possibilities. If they have to wait overnight for each model, they will take steps to reduce the number of models investigated, such as studying smaller sampled databases and including fewer columns. Not surprisingly, their results will be different and quite likely not as useful.

While model building can be accelerated through the use of judicious sampling, such sampling -- especially if complex (e.g., stratified or clustered) -- itself requires substantial power. In the end, nothing is more useful than adequate computer power and nothing so changes the analyst's approach as inadequate power.

Example: The marketing department of a research-intensive company decides to investigate the possibility of using data mining techniques on several of their databases. They acquire some data mining software and request that some of the statisticians in research assist them as analytic experts. The statisticians have work stations that are excellent for the analyses they usually do on modest size research databases. Typically, they get immediate response to virtually any request for analysis or processing or graphical displays. The statisticians are initially excited by the novelty of the data mining assignment and dealing with huge, complex databases. Their interest quickly wanes, however, when faced with response times in excess of a few hours. They report that data mining methods are cumbersome and useless, and they do not find patterns in the data. They recommend that the effort be terminated.

If these statisticians had had a data mining tool that scaled to handle the larger amount of data they were now using, on appropriately sized parallel hardware, they would have stayed enthusiastic about data mining.

The size of a database is not the only reason to use scalable tools; other factors in building, testing, and deploying a data mining solution may come into play. In the example below, we see how within a single database, one problem dealing with a large portion of the data requires a scalable solution because of the number of cases, while the complexity of another portion leads to the requirement for a scalable solution.

Example: A credit card vendor includes a product offer in each month's statement. It would like to mine its extensive customer database to increase the number of people who take advantage of the monthly offer. Relatively simple preliminary analysis shows that the card holders split into two groups with different buying patterns.

The first group is relatively large and responds to moderately priced offers at a low rate. Because the group is so large, a modest percentage increase in the response rate will easily pay for the data mining effort. Also, the effort promises to be a relatively simple exercise since no more than five variables account for a large part of the differences between purchasers and non-purchasers. Although there are a huge number of customers, only a few columns will be analyzed, simplifying the model building. Scalability means the model may be created quickly despite the large number of rows.

The second group presents a different challenge. While small relative to the larger group, it is still of substantial size. The response rate in this group is very low but the items purchased are expensive. Therefore, even a modest increase in response rate will have a large payoff, justifying the effort required to mine this data. Unfortunately, to determine what is unique about this second group of purchasers will require all of the data available on each customer and several complex models. For this problem, scalability permits building a large number of models to adequately explore the data and find the unique characteristics of this second group.

It is important to remember that the amount of data you are going to mine is likely to increase over time. Furthermore, other problems to which you wish to apply data mining may be larger or more complex than your current problem. Thus a system that is scalable can not only meet your current requirements, but grow with them. In some cases, it is difficult to precisely know your requirements before you get well into the application. This will not be a problem with a scalable tool because it can accommodate a range of requirements.

Top of page

III. Technical Factors Requiring Use of Scalable Data Mining Tools

1. Database size

In finding the patterns hidden within a database, you may need to sift enormous amounts of data. The optimum platform for accomplishing this is to run parallel data mining algorithms and parallel DBMSs on parallel hardware.

The following example illustrates how the size of a database to be investigated can quickly grow. It also shows how mining your data may require many different models.

Example: A chemical manufacturing plant is not satisfied with the yield or quality of a product and wants to investigate the various factors that might affect these two variables. An online database stores variables such as temperature, pressure and time for each six-hour run. These observations are taken at one-second intervals. The plant runs seven days a week, twenty-four hours a day, with one batch run on each shift. Thus, with over 20,000 observations per shift, there is a tremendous amount of data.

The chemists and chemical engineers identify several variables they are certain affect the yield and quality. They speculate that other environmental conditions such as ambient humidity may affect results. A decision will have to be made if such items are to be investigated, since that data will have to come from an outside source such as the weather bureau.

They define their problem as lower than acceptable product yield and quality. They decide that a minimal 20% improvement in yield and a 10% improvement in quality is an acceptable goal for their data mining project. They list all the variables they will investigate and list what models might be useful to them. They write all of this information in a protocol that guides their investigations over the coming weeks.

Data analysis approaches and computer storage and processing requirements are affected by database size. A useful measure of database size is the number of rows and columns. Rows refers to the number of units or cases -- customers, transactions, products, patients, items on an assembly line. Columns refers to the number of different measurements taken on each item. For example, a mail-order catalog house transaction would be a row, and the columns might include customer information, catalog numbers of items ordered, their price, the credit card used, whether the customer wanted expedited delivery, and so forth.

Rows and columns are a better indication of the amount of time it will take to build a model than simply the aggregate number of bytes. When the number of rows in the database is large, either all of the rows will need to be processed or a random sample of the rows will need to be taken. In either event, the amount of time to build a model will be greater than if there is a smaller number of rows which can be quickly processed.

The attributes of the columns can also add to size. For example, a categorical variable such as GENDER has only two values, but one such as STATE may have fifty-one or more values. Some algorithms, such as neural nets, prefer to have categorical variables split up into multiple columns so that each value becomes a column with a value of Yes or No. Thus STATE might be broken into fifty-one columns, with Alaska having a value of Yes or No, and so on. Not only does the size of the database swell, but the amount of searching that the algorithm must do goes up dramatically.

The number of columns and the attributes of those columns will also influence the complexity of the data mining model, and hence the time it takes to build.

Data subsets and samples

In almost all data mining applications the data is divided into a number of subsets, or samples. The sample might be from a database that's too large to process in full, for example, or it may have been selected for a pilot. Sample selection must be done carefully to ensure the sample is random.

Training and testing the data mining model will also require sampling the data to split it into at least two groups: one for model training (i.e., estimation of the model parameters) and one for model testing. If you don't use different training and test data, the accuracy of the model will be overestimated. After the model is generated using the training database, it is used to predict the test database, and the resulting error rate is a good estimate of how the model will perform on future databases. There are other more complicated ways to estimate error, but, as we shall see, they also involve sampling the data in a random way.

For most problems, an appropriate sample of ten percent of the data or even less can give excellent results when training a model. Straightforward calculations show that for most business problems there is virtually no loss in information when this is done. Numerous models could be investigated on a small sample of the data and only the best few be trained on the whole database. Though such approaches do not eliminate the need for powerful computing, they can speed up the model effort.

There are times when business requirements necessitate sampling. For example, suppose that every record in the analysis set had to be reviewed by a panel of experts to determine the value of an unrecorded data item. This is often the case with exact cause of death in epidemiological studies. To have such a panel of experts review 100,000 records might take months or years, but a 10% sample may be done in an acceptable time.

Example: A software vendor develops a model that can be used to control the settings of a vat in a chemical plant to increase yield. Once every three minutes, the model uses thirteen characteristics of the raw materials and readings of five vat conditions and three ambient conditions as inputs. It then outputs values for three vat conditions that are used to regulate the process in real time.

After pilot projects at three different plants the vendor begins to sell the software model and a computer that accepts the readings, trains the model and automatically sets the vat conditions. The company finds that one group is very happy with the unit while the other group is so upset with performance that they have disconnected the unit and returned to manual setting.

The vendor finds that the complaints all come from companies who have new plants or have had new equipment recently installed. The new equipment takes a reading every second instead of every three minutes. The computer is much slower processing 180 times as much data, and in fact, delivers forecasts for what was happening two minutes earlier. This often leads to unstable feedback that results in lost batches.

To respond to the unhappy chemical manufacturers, the vendor puts together a group of three analysts to study the problem and implement a solution. They note that the conditions in the vat do not change much in a second or even in a minute. The device that records the data every second is a bit of technological excess. They write a short piece of software that takes a systematic sample of one point in every sixty, which they use as the input to the computer. Although three times as much data is entered as before, the model can produce forecasts quickly enough.

Had they written their software to take advantage of parallel computing, they would have been able to solve the problem by simply adding CPUs.

There are also times when sampling is not appropriate. This is particularly true when we are trying to discover subtle effects that may be exhibited by only a small portion of the population. In such a case, sampling can lose the information necessary to find the pattern for which we are searching.

Deployment

Even after the data mining model is built, there may be a need for parallel computing to apply the model. In some situations, the data mining model is applied to one event or transaction at a time, such as scoring a loan application for risk. The amount of time to process each new transaction, and the rate at which new transactions arrive, will determine whether a scalable algorithm is needed. Thus, while the loan applications can probably be easily evaluated on modest sized computers, monitoring credit card transactions or cellular telephone calls for fraud would require a scalable system to deal with the high transaction rate.

Often a data mining model is applied to a batch of data such as an existing customer database, a newly purchased mailing list or a monthly record of transactions from a retail store. In this case, the large quantity of data to be processed would also require that a scalable solution be deployed.

Top of page

2. Model-building complexity, performance monitoring and tuning

Building a complex model with many variables, even when the number of rows is small, may be computationally intensive and will therefore benefit from parallel data mining algorithms. In all data mining applications, extra variables can introduce noise and reduce the accuracy of your model. At the same time, you don't want to leave out important variables that can increase your accuracy or simplify the application of the model to new data. Consequently, you often need to do additional analysis and build different models to ensure that you are using the right columns.

Furthermore, searching for the best model may require building and testing many different models, sometimes numbering in the hundreds, before the best solution can be found. For example, neural nets are a popular data mining technique that often require building many different models with slightly different architectures in order to find the one with the lowest error rate. Not only must you often build multiple models, but you need to do it with different algorithms. Not all algorithms are suitable for every purpose nor can each find all the patterns for which you are searching. This can further extend the length of time it takes to find the right model.

Another problem arises from the fact that the increase in time for some data analytic techniques is worse than linear. For example, the time to complete a simple linear regression increases approximately linearly for the number of observations (i.e., rows), but increases approximately with the square of the number of columns in the model. Thus, fitting six columns will take four times as long as fitting three columns. Similar calculations apply to neural nets, k-nearest neighbor methods, most time series methods and any regression technique.

Though all data mining techniques are computer-intensive, some are particularly so. For example, for k-nearest nearest neighbor the calculation time increases as the factorial of the total number of points, and the calculation must be made every time the model is used. If each model you build requires even an hour, let alone a day, you will be unable to test all the variations you would like to.

Clearly, if it takes a long time to build each model, or the number of models you must build to find the one you want is large, the only way to effectively mine your data is with a scalable tool.

Even when you think you are finished because your model works well, you must continually monitor the performance of the model. Over time, all systems evolve and the data they produce changes. Salespeople know that purchasing patterns change over time. External variables such as inflation rate may change enough to alter the way people behave and the factors that influence that behavior. Even precise manufacturing systems age and cause drift in the data they produce. Thus, from time to time, the model will have to be retrained and possibly completely rebuilt. Control charts of the residual differences between the forecasted values and the observed variable are an excellent way to monitor model results. Control charts are easy to use and understand, not computationally intensive and could be built into the software that implements the model. Thus, the system could monitor itself.

For many data mining applications, you are never completely done.

Example: A financial institution wants to reduce the number of delinquent and fraudulent transactions it processes. The managers believe that most delinquencies and some of the fraud can be relatively easily detected by checking a few variables associated with each transaction. On the other hand, there is a class of fraudulent transactions carried out by knowledgeable individuals that are detectable but much more subtle.

The staff meets with a data mining consultant and financial modeler to describe the situation and available data. The consultants suggest that a relatively simple logistic regression model effort using only three predictors can handle delinquencies. The forecasted error rate is only 3%.

The fraudulent model is much more complicated and a neural net proves to be the best possibility. The model detects fifty-two percent of the fraud cases. To improve the number of frauds caught, the consultants suggest a policy change. In six months, the model will be re-evaluated in light of the policy change.

Top of page

3. Training and validation

Training and validating a model may require large amounts of data manipulation, and up to thousands of training sessions. Data mining error rates must be determined by testing rather than calculation, and all the methods used are computing-intensive. Newer techniques are likely to be even more so. As you will see, in applying these testing methods scalability is a key to efficient model development.

A model built by any technique needs to be validated, which means calculating an error rate based on data independent of that used to estimate the model. This exercise gives a statistically valid estimate of the true error rate that the modeling procedure produces. It does not guarantee that the model is correct in any way. It simply says that if the same technique were used on a succession of databases to build a model, the average error rate would be close to the one obtained this way.

Because data-mining model building is so complex, you cannot simply calculate the error rate. Instead the model needs to be tested against other data. The testing may reflect the complexity of the model being developed. One of the ironies of data mining is that complex testing methodologies are required when you have only a few cases or samples on which to build the model. As we shall see, complex testing schemes make heavy use of the computer and are one of the reasons scalability is required.

The most basic testing method is called simple validation. To carry this out, we set aside a percentage of our database and do not use it in any way in the model building and estimation. This percentage is usually small, perhaps 5% to 20% , but there is no fixed rule. For a very large database, we may set aside 50% or more of the data. For all the future calculations to be correct, the division of the data into two groups must be random.

After building and estimating our model on the main body of the data, we use the model to predict the classes of the validation database. We then count up the number of correct and incorrect classifications and calculate an error rate.

In building a single model, even this simple validation may need to be performed dozens of times. For example, when using a neural net, each training pass through the net is tested against a test database. Training stops when the error rates on the test database no longer improve with additional iterations. In such a case, being able to perform the training and testing cycle using many CPUs instead of just one will help you arrive at a model sooner.

Even small databases present challenges that require major computing power. In a small database, setting aside some data for testing and validation makes the estimation less precise. To address this situation, cross validation was developed. In this method, we randomly divide the data into two equal sets. We estimate on the first set and predict the second. But since the data were randomly divided, we can reverse the situation and estimate on the second database and predict the first. We now have two independent (because we randomly divided) estimates of error rates. We can take the average of the two and get a better estimate of the true error rate.

In cross validation we have estimated two models. Which do we use? Neither. Instead we generate a third model using all the data.The error rate for the model using all of the data is the average of the error rates for the two halves. Thus we have a model using all of the data for estimation and all of the data for error calculation.

In practice, a database is usually divided into more than just two groups to improve error estimation, and the more general n-fold cross validation is used. In this method, we randomly divide the data into n disjoint groups. For example, suppose we divided the data into ten groups. The first group is set aside for prediction and the other nine are lumped together for estimation. The model built and estimated on the 90% group is then used to predict the group that was set aside. This process is repeated a total of 10 times as each group in turn is set aside and the model built and estimated on the remaining 90% of the data. The model is used to predict the set aside group each time. Finally, we take the mean of the 10 independent error rate predictions as our error rate.

Clearly these more complex methods require much more computing. In ordinary cross validation, we have to estimate three models and forecast on two sets of data. The amount of computing required by most models is governed more by the number of variables in the model than by the number of observations. Therefore, the model estimated on each half of the data requires more than half the computing of fitting one model to all of the data. With the fit to all of the data, cross validation requires at least three times the computing as does simple validation. N-fold cross validation, on the other hand, requires at least n times as much computing as simple validation. Since n is usually around 10, this means an order of magnitude increase in computation. The increase in accuracy and precision is usually worth the effort, however.

Example: Some physicians studying a large epidemiological database have developed a model for predicting the eventual expression of a debilitating disease in patients. Though the model fits the data well, a correct error-rate calculation is crucial. If the patient is likely to develop the disease, he must submit to a strict, expensive and in part unpleasant, dietary and medicinal regime. So it is important to minimize false positives. If the patient does not submit to the regime and later develops the disease, there is no later treatment and he is doomed. So it is important to know the false negative rate.

Although the database is large, the physicians use n-fold cross validation to split the data into ten groups, repeatedly estimate the model on nine tenths of the data and predict the final tenth. In this way, they use all of the data for both training and testing their model. Based on the error rate calculation, they decide that on close calls (i.e., patients who are not firmly in one of the two groups) additional blood and genetic tests be carried out. They build a more complicated model for such patients and lower the overall error rate to an acceptable level.

The next method, bootstrapping, requires even more computing to validate a model than does n-fold cross validation.

Briefly, when applying the bootstrapping algorithm you take a random sample, with replacement, of the data of size equal to the whole database. In other words, if you had 1,000 rows of data, you would build a sample of 1,000 rows by picking one row at random. After putting it back, you would picking another row, put it back, and keep on doing this until you had a total of 1,000 rows in your sample. Note that this means that some rows may appear more than once and some may not appear at all. You split the sample into two groups: one for training and one for validation. You repeat this cycle of model building at least 200 times - and in some cases up to 10,000 times. The mean error rate is the mean over all of the repetitions (bootstraps). By building enough models, you will converge on a very low error rate.

The bootstrap is extremely powerful as it gives not only the error, but it also gives the variance of the error as the variance over all of the bootstraps. This is an unbiased and usually efficient (lowest possible variance) estimate of the variance of the error rate.

Because we are going through the training-test cycle conceivably thousands of times, clearly bootstrapping is computationally intensive. Furthermore, the large number of samples may put an I/O burden on the computer. Only with modern parallel hardware and algorithms can we hope to use bootstrapping effectively.

Example: In a manufacturing process, the only way to absolutely determine if an item is defective is to test it to destruction. Based on some test data, a model has been developed that will predict the status of an item without destructive testing. Only a few variables are used in the model. Although the form of the model remains constant, the parameters change with each new shipment of raw material. The model is sensitive to slight changes in some of the parameters, so it is important to use as much data as possible to estimate these parameters. However, since the raw material is expensive and testing of the items is destructive, the plant wants to use as little material as possible for testing.

A data mining consultant recommends a technique called bootstrapping to estimate and validate the model. This method, which attempts to use every possible split of the data by repeatedly taking samples from the data, is computer intensive and must be run on a moderate sized computer overnight. For faster model building, or to handle more data, they will need to move to a parallel computer.

Top of page

IV. Making Data Mining Tools Scalable

1. Hardware scalability

We have talked about the need for parallel hardware for data mining algorithms and database access. As we have seen, many data mining problems involve large, complex databases, complicated modeling techniques and substantial computer processing. In addition to the actual computation, a great deal of processing is needed to select and reformat data for the data mining database. Various selection and subsetting functions are needed, and display functions are used to view data and results. All of these operations take considerable power for databases that store millions or possibly tens of millions of rows. Without adequate processing power, a large or complex data mining effort will come to a halt.

The goal of hardware scalability is to provide high performance by adding modestly priced processor building blocks (or nodes) in such a way that performance scales linearly. For example, by doubling the number of processors you can process twice as much data in the same time or the same amount of data in half the time. The hardware vendors have taken two complementary approaches of using multiple CPUs: symmetric multi-processing (SMP) and massively parallel processing (MPP).

It is important to understand the different routes to achieving hardware parallelism as they are most frequently used in data mining. A company beginning a data mining effort has to decide whether its computing facilities are adequate, and if not, what steps to take.

Example: A small trucking company decided to mine their databases to explore scheduling their delivery trucks more efficiently. Their corporate computer, though perfectly adequate for the existing transactional databases, proves totally inadequate when analysts attempt to merge them into one large analysis database. Furthermore, the data analysis and data mining applications make it impossible for the corporate computer to maintain its level of transaction throughput.

The company has to off-load its databases to a data warehouse running on an SMP parallel computer with a parallel DBMS optimized for decision support and data mining. The analysts then work on the data warehouse computer rather then on the corporate transaction computer. As the database size grows, they will add additional processing units. At some point, they may need to consider migrating to an MPP architecture.

In an SMP computer, there is a collection of processing nodes, each with its own private local cache memory. However, there is only a single instance of the operating system, and the disks and main memory are shared through a common bus. The amount of time for any processor to access main memory is equal. A major appeal of SMP is that it is relatively easy to administer, in part because it looks like one computer with a single operating system. It can be used by a DBMS with little if any reprogramming, simply by assigning tasks to CPUs, although modifications such as multithreading based on small grained, light weight threads can let the DBMS take better advantage of SMP.

Even greater degrees of scalability can be achieved by moving to one of a variety of MPP architectures. For example, in the so-called "shared nothing" MPP architecture, each node has its own main memory and I/O subsystem, but shares a backplane over which data or requests for services (called functions) can be shipped. Each node in a shared nothing system can be an SMP node, and these can be coupled together to bring hundreds of CPUs to bear on a problem. These are sometimes called SMP clusters, and should not be confused with clustered disk access systems in which a group of CPUs share an I/O system.

Top of page

2. Parallel algorithms

Data mining algorithms written for a uniprocessor machine won't automatically run faster on a parallel machine; they must be rewritten to take advantage of the parallel processors. There are two basic ways of accomplishing this. In the first method, independent pieces of the application are assigned to different processors. The more processors, the more pieces can be executed without reducing throughput. This is called inter-model parallelism. This kind of scale-up is also useful in building multiple independent models. For example, a neural net application could build multiple models using different architectures (e.g., with a different number of nodes or hidden layers) simultaneously on each processor.

But what happens if building each model takes a long time? We then need to break this model into tasks, execute those tasks on separate processors, and recombine them for the answer. This second method is called intra-model parallelism.

Top of page

3. Native database access and parallel DBMSs

As we saw earlier, large data mining problems can require large amounts of database access. This requirement can come from such operations as mining a large amount of data, needing to sample a large database, building many different models, or needing to retrieve many samples when testing and validating the model

If the data from an existing database can be mined, then native DBMS access will provide the best possible performance. Most data warehouses today make use of parallel database management systems. However, because the data mining process may involve data from multiple databases, or subsets of data from a data warehouse, it is usually a good idea to create a data mart for purposes of mining that data. This avoids the problem of consolidating data every time you want to use it.

While in some cases the data mart may use a special-purpose DBMS developed for the data mining tool, in most cases it will use a standard DBMS. In either case, the size of the data mart or allowing for growth in the size of the data mart will require the data store to be parallelized.

A scalable data mining tool should be able to directly access the data mart DBMS, rather than require that the data to be mined has been written to a flat file. Ideally, the tool uses the native SQL of the database server, which maximizes performance and takes advantage of individual server features such as stored procedures and SQL extensions. No single product, however, can support the large variety of database servers, so a gateway must be used for all but the four or five leading DBMSs. The most common gateway supported is Microsoft's ODBC (Open Database Connectivity).

A parallel DBMS can assign individual SQL statements to CPUs. This is called inter-query parallelism. Alternatively, it may break up a large SQL statement and assign pieces of it to different CPUs. This is called intra-query parallelism. Both of these approaches are analogous to how data mining models are parallelized.

If there is more than one data stream, it's possible for some operations to proceed simultaneously. For example, a customer database being mined could be spread across multiple disks, and a thread could process each subset of the customer data. This is called partitioned parallelism. Because data can be partitioned into many subsets, this technique provides a much greater opportunity for performance improvement.

All of these types of parallel processing can happen in either SMP or shared-nothing MPP architectures. The algorithms for good SMP performance, however, are different than those for good shared nothing performance.

A good partitioning scheme is an essential part of designing a database that will benefit from parallelism. Partitioned data benefits from indexed access just like non-parallel databases do. The indexes may also be partitioned. To get the best results of partitioning, the allocation scheme must be carefully designed to minimize skew and data movement. The choices will largely be determined by the queries run against the database, and the inevitable tradeoffs will mean some queries will run faster than others. In particular, any partitioning scheme can run afoul of unanticipated queries, and the parallel optimizer needs to be able to deal with these variations. A good parallel optimizer has a host of techniques to minimize the need for data shipping, but, even in a shared nothing architecture, sometimes considerable amounts of data will need to be shipped. There are so many variables among the hardware configurations and data partitioning that it is impossible to draw conclusions from vendor benchmarks.

Top of page

V. Conclusion

Data mining is an important method for extracting valuable information from all sizes of databases: large and small. Data miners need fast response time when building their models so they can construct the most effective model. Three factors, however, make developing a data mining model a potentially lengthy process:

  1. The enormous amount of data that must be processed when mining or sampling massive databases.
  2. The large number of models that must be built to explore complex databases.
  3. The intricacies of testing and validating models.

The solution to these large-scale problems is through applying parallel technology. Scalable data mining tools take advantage of hardware parallelism, including parallel algorithms as well as direct access to parallel DBMSs.

Top of page


Scalable Data Mining Requirements

The following is a list of items to consider when evaluating the need for scalable data mining solutions.

  1. Database size
    • Number of tables
    • Required storage in bytes
    • Number of rows in each table
    • Number of columns in each table
    • Number of samples
    • Type of samples (e.g., stratified, clustered)
    • Model deployment
    • Arrival rate of new transactions
    • Processing time per transaction
    • Response time required
    • Size of batches to be processed
  2. Model complexity
    • Number of columns
    • Data types for columns (continuous, categorical)
    • Categories of problems to be solved
    • Classification
    • Clustering
    • Association
    • Sequences
    • Value forecasting
    • Other
    • Types of algorithms
    • Decision trees
    • Neural nets
    • k-nearest neighbor
    • Memory based reasoning
    • Other
  3. Training and testing
    • Simple validation
    • N-fold cross validation
    • Bootstrapping

Top of page

Scalable Data Mining Tools

The following list summarizes those features of a data mining tool that make it scalable.

  1. Parallel algorithms
    • Inter-model parallelism
    • Intra-model parallelism
  2. Database processing
    • Sequential file
    • Native database access
    • DB2
    • Informix
    • Oracle
    • Sybase
    • Other
    • ODBC
    • Support of parallel DBMSs
  3. Hardware support
    • SMP computers
    • MPP computers

Top of page

© 1997 by Two Crows Corp. Two Crows and the Two Crows Logo are trademarks of Two Crows Corp. All other logos, trademarks and corporate names are trademarks of their respective owners.
 Home  |  About Data Mining  |  Publications  |  Seminars  |  Consulting |  About Two Crows