
| Home | | | About Data Mining | | | Publications | | | Seminars | | | Consulting | | | About Two Crows |
|
Table of Contents |
I. IntroductionData mining can yield exciting results for almost every organization that collects data on its customers, markets, products or processes. By discovering hidden patterns and relationships in the data, data mining enables users to extract greater value from their data than simple query and analysis approaches. To discover the hidden patterns in data, we need to build a model consisting of independent variables (e.g., income, marital status) that can be used to determine a dependent variable (e.g., credit risk). Building a data mining model consists of identifying the relevant independent variables and minimizing the predictive error. To identify the model that has the least error and is the best predictor may require building hundreds of models in order to select the best one. Fortunately, we have reached a point in terms of computational power, storage capacity and cost that enables us to gather, analyze and mine unprecedented amounts of data. While some data mining problems are amenable to a desktop or simple client-server approach, many problems -- due to their size or complexity -- require a scalable data mining product. Literally, scalability means that as a system gets larger, its performance improves correspondingly. For data mining, scalability means that by taking advantage of parallel database management systems and additional CPUs, you can solve a wide range of problems without needing to change your underlying data mining environment. You can work with more data, build more models, and improve their accuracy by simply adding additional CPUs. Ideally, scalability should be linear or better. For example, if you double the number of CPUs in a parallel system, you can build twice as many models in the same amount of time, or the same number of models in half the time. The rest of this white paper will examine in detail the reasons scalable data mining solutions are needed and the main methods of achieving scalability. While our discussion will be based on the problem of predicting which groups or classes a case (or row of data) falls into, the need for scalability covers all categories of data mining. We'll make extensive use of real examples drawn from a variety of organizations to illustrate the breadth of applications to which scalable data mining may be applied as well as the universality of the issues. II. Users Need Scalable Data MiningFundamentally, the reason for scalable solutions is to be able to build a good data mining model as quickly as possible. This offers two benefits. First, there is value in being able to deploy and use the model sooner rather than later. In the second example below, finding a good model will immediately increase the payoff of the product promotions included with bills. The credit card vendor doesn't want to let many billing cycles pass without taking advantage of the data mining results. Second, faster turnaround times yield better models. As we shall see, mining large databases and constructing complex models call for lots of computing power, as do sampling, testing and validating. If analysts have to wait hours or overnight for a model to run, their effort is going to be very different than if they can get a model trained in a few minutes. Time previously spent waiting for results can instead be devoted to finding the model that results in the best, most reliable solution. For example, if in a few minutes analysts can create a group of models and graph their forecasts, they can pick from a large number of model possibilities. If they have to wait overnight for each model, they will take steps to reduce the number of models investigated, such as studying smaller sampled databases and including fewer columns. Not surprisingly, their results will be different and quite likely not as useful. While model building can be accelerated through the use of judicious sampling, such sampling -- especially if complex (e.g., stratified or clustered) -- itself requires substantial power. In the end, nothing is more useful than adequate computer power and nothing so changes the analyst's approach as inadequate power.
If these statisticians had had a data mining tool that scaled to handle the larger amount of data they were now using, on appropriately sized parallel hardware, they would have stayed enthusiastic about data mining. The size of a database is not the only reason to use scalable tools; other factors in building, testing, and deploying a data mining solution may come into play. In the example below, we see how within a single database, one problem dealing with a large portion of the data requires a scalable solution because of the number of cases, while the complexity of another portion leads to the requirement for a scalable solution.
It is important to remember that the amount of data you are going to mine is likely to increase over time. Furthermore, other problems to which you wish to apply data mining may be larger or more complex than your current problem. Thus a system that is scalable can not only meet your current requirements, but grow with them. In some cases, it is difficult to precisely know your requirements before you get well into the application. This will not be a problem with a scalable tool because it can accommodate a range of requirements. III. Technical Factors Requiring Use of Scalable Data Mining Tools1. Database sizeIn finding the patterns hidden within a database, you may need to sift enormous amounts of data. The optimum platform for accomplishing this is to run parallel data mining algorithms and parallel DBMSs on parallel hardware. The following example illustrates how the size of a database to be investigated can quickly grow. It also shows how mining your data may require many different models.
Data analysis approaches and computer storage and processing requirements are affected by database size. A useful measure of database size is the number of rows and columns. Rows refers to the number of units or cases -- customers, transactions, products, patients, items on an assembly line. Columns refers to the number of different measurements taken on each item. For example, a mail-order catalog house transaction would be a row, and the columns might include customer information, catalog numbers of items ordered, their price, the credit card used, whether the customer wanted expedited delivery, and so forth. Rows and columns are a better indication of the amount of time it will take to build a model than simply the aggregate number of bytes. When the number of rows in the database is large, either all of the rows will need to be processed or a random sample of the rows will need to be taken. In either event, the amount of time to build a model will be greater than if there is a smaller number of rows which can be quickly processed. The attributes of the columns can also add to size. For example, a categorical variable such as GENDER has only two values, but one such as STATE may have fifty-one or more values. Some algorithms, such as neural nets, prefer to have categorical variables split up into multiple columns so that each value becomes a column with a value of Yes or No. Thus STATE might be broken into fifty-one columns, with Alaska having a value of Yes or No, and so on. Not only does the size of the database swell, but the amount of searching that the algorithm must do goes up dramatically. The number of columns and the attributes of those columns will also influence the complexity of the data mining model, and hence the time it takes to build. Data subsets and samples In almost all data mining applications the data is divided into a number of subsets, or samples. The sample might be from a database that's too large to process in full, for example, or it may have been selected for a pilot. Sample selection must be done carefully to ensure the sample is random. Training and testing the data mining model will also require sampling the data to split it into at least two groups: one for model training (i.e., estimation of the model parameters) and one for model testing. If you don't use different training and test data, the accuracy of the model will be overestimated. After the model is generated using the training database, it is used to predict the test database, and the resulting error rate is a good estimate of how the model will perform on future databases. There are other more complicated ways to estimate error, but, as we shall see, they also involve sampling the data in a random way. For most problems, an appropriate sample of ten percent of the data or even less can give excellent results when training a model. Straightforward calculations show that for most business problems there is virtually no loss in information when this is done. Numerous models could be investigated on a small sample of the data and only the best few be trained on the whole database. Though such approaches do not eliminate the need for powerful computing, they can speed up the model effort. There are times when business requirements necessitate sampling. For example, suppose that every record in the analysis set had to be reviewed by a panel of experts to determine the value of an unrecorded data item. This is often the case with exact cause of death in epidemiological studies. To have such a panel of experts review 100,000 records might take months or years, but a 10% sample may be done in an acceptable time.
There are also times when sampling is not appropriate. This is particularly true when we are trying to discover subtle effects that may be exhibited by only a small portion of the population. In such a case, sampling can lose the information necessary to find the pattern for which we are searching. Deployment Even after the data mining model is built, there may be a need for parallel computing to apply the model. In some situations, the data mining model is applied to one event or transaction at a time, such as scoring a loan application for risk. The amount of time to process each new transaction, and the rate at which new transactions arrive, will determine whether a scalable algorithm is needed. Thus, while the loan applications can probably be easily evaluated on modest sized computers, monitoring credit card transactions or cellular telephone calls for fraud would require a scalable system to deal with the high transaction rate. Often a data mining model is applied to a batch of data such as an existing customer database, a newly purchased mailing list or a monthly record of transactions from a retail store. In this case, the large quantity of data to be processed would also require that a scalable solution be deployed. 2. Model-building complexity, performance monitoring and tuningBuilding a complex model with many variables, even when the number of rows is small, may be computationally intensive and will therefore benefit from parallel data mining algorithms. In all data mining applications, extra variables can introduce noise and reduce the accuracy of your model. At the same time, you don't want to leave out important variables that can increase your accuracy or simplify the application of the model to new data. Consequently, you often need to do additional analysis and build different models to ensure that you are using the right columns. Furthermore, searching for the best model may require building and testing many different models, sometimes numbering in the hundreds, before the best solution can be found. For example, neural nets are a popular data mining technique that often require building many different models with slightly different architectures in order to find the one with the lowest error rate. Not only must you often build multiple models, but you need to do it with different algorithms. Not all algorithms are suitable for every purpose nor can each find all the patterns for which you are searching. This can further extend the length of time it takes to find the right model. Another problem arises from the fact that the increase in time for some data analytic techniques is worse than linear. For example, the time to complete a simple linear regression increases approximately linearly for the number of observations (i.e., rows), but increases approximately with the square of the number of columns in the model. Thus, fitting six columns will take four times as long as fitting three columns. Similar calculations apply to neural nets, k-nearest neighbor methods, most time series methods and any regression technique. Though all data mining techniques are computer-intensive, some are particularly so. For example, for k-nearest nearest neighbor the calculation time increases as the factorial of the total number of points, and the calculation must be made every time the model is used. If each model you build requires even an hour, let alone a day, you will be unable to test all the variations you would like to. Clearly, if it takes a long time to build each model, or the number of models you must build to find the one you want is large, the only way to effectively mine your data is with a scalable tool. Even when you think you are finished because your model works well, you must continually monitor the performance of the model. Over time, all systems evolve and the data they produce changes. Salespeople know that purchasing patterns change over time. External variables such as inflation rate may change enough to alter the way people behave and the factors that influence that behavior. Even precise manufacturing systems age and cause drift in the data they produce. Thus, from time to time, the model will have to be retrained and possibly completely rebuilt. Control charts of the residual differences between the forecasted values and the observed variable are an excellent way to monitor model results. Control charts are easy to use and understand, not computationally intensive and could be built into the software that implements the model. Thus, the system could monitor itself. For many data mining applications, you are never completely done.
3. Training and validationTraining and validating a model may require large amounts of data manipulation, and up to thousands of training sessions. Data mining error rates must be determined by testing rather than calculation, and all the methods used are computing-intensive. Newer techniques are likely to be even more so. As you will see, in applying these testing methods scalability is a key to efficient model development. A model built by any technique needs to be validated, which means calculating an error rate based on data independent of that used to estimate the model. This exercise gives a statistically valid estimate of the true error rate that the modeling procedure produces. It does not guarantee that the model is correct in any way. It simply says that if the same technique were used on a succession of databases to build a model, the average error rate would be close to the one obtained this way. Because data-mining model building is so complex, you cannot simply calculate the error rate. Instead the model needs to be tested against other data. The testing may reflect the complexity of the model being developed. One of the ironies of data mining is that complex testing methodologies are required when you have only a few cases or samples on which to build the model. As we shall see, complex testing schemes make heavy use of the computer and are one of the reasons scalability is required. The most basic testing method is called simple validation. To carry this out, we set aside a percentage of our database and do not use it in any way in the model building and estimation. This percentage is usually small, perhaps 5% to 20% , but there is no fixed rule. For a very large database, we may set aside 50% or more of the data. For all the future calculations to be correct, the division of the data into two groups must be random. After building and estimating our model on the main body of the data, we use the model to predict the classes of the validation database. We then count up the number of correct and incorrect classifications and calculate an error rate. In building a single model, even this simple validation may need to be performed dozens of times. For example, when using a neural net, each training pass through the net is tested against a test database. Training stops when the error rates on the test database no longer improve with additional iterations. In such a case, being able to perform the training and testing cycle using many CPUs instead of just one will help you arrive at a model sooner. Even small databases present challenges that require major computing power. In a small database, setting aside some data for testing and validation makes the estimation less precise. To address this situation, cross validation was developed. In this method, we randomly divide the data into two equal sets. We estimate on the first set and predict the second. But since the data were randomly divided, we can reverse the situation and estimate on the second database and predict the first. We now have two independent (because we randomly divided) estimates of error rates. We can take the average of the two and get a better estimate of the true error rate. In cross validation we have estimated two models. Which do we use? Neither. Instead we generate a third model using all the data.The error rate for the model using all of the data is the average of the error rates for the two halves. Thus we have a model using all of the data for estimation and all of the data for error calculation. In practice, a database is usually divided into more than just two groups to improve error estimation, and the more general n-fold cross validation is used. In this method, we randomly divide the data into n disjoint groups. For example, suppose we divided the data into ten groups. The first group is set aside for prediction and the other nine are lumped together for estimation. The model built and estimated on the 90% group is then used to predict the group that was set aside. This process is repeated a total of 10 times as each group in turn is set aside and the model built and estimated on the remaining 90% of the data. The model is used to predict the set aside group each time. Finally, we take the mean of the 10 independent error rate predictions as our error rate. Clearly these more complex methods require much more computing. In ordinary cross validation, we have to estimate three models and forecast on two sets of data. The amount of computing required by most models is governed more by the number of variables in the model than by the number of observations. Therefore, the model estimated on each half of the data requires more than half the computing of fitting one model to all of the data. With the fit to all of the data, cross validation requires at least three times the computing as does simple validation. N-fold cross validation, on the other hand, requires at least n times as much computing as simple validation. Since n is usually around 10, this means an order of magnitude increase in computation. The increase in accuracy and precision is usually worth the effort, however.
The next method, bootstrapping, requires even more computing to validate a model than does n-fold cross validation. Briefly, when applying the bootstrapping algorithm you take a random sample, with replacement, of the data of size equal to the whole database. In other words, if you had 1,000 rows of data, you would build a sample of 1,000 rows by picking one row at random. After putting it back, you would picking another row, put it back, and keep on doing this until you had a total of 1,000 rows in your sample. Note that this means that some rows may appear more than once and some may not appear at all. You split the sample into two groups: one for training and one for validation. You repeat this cycle of model building at least 200 times - and in some cases up to 10,000 times. The mean error rate is the mean over all of the repetitions (bootstraps). By building enough models, you will converge on a very low error rate. The bootstrap is extremely powerful as it gives not only the error, but it also gives the variance of the error as the variance over all of the bootstraps. This is an unbiased and usually efficient (lowest possible variance) estimate of the variance of the error rate. Because we are going through the training-test cycle conceivably thousands of times, clearly bootstrapping is computationally intensive. Furthermore, the large number of samples may put an I/O burden on the computer. Only with modern parallel hardware and algorithms can we hope to use bootstrapping effectively.
IV. Making Data Mining Tools Scalable1. Hardware scalabilityWe have talked about the need for parallel hardware for data mining algorithms and database access. As we have seen, many data mining problems involve large, complex databases, complicated modeling techniques and substantial computer processing. In addition to the actual computation, a great deal of processing is needed to select and reformat data for the data mining database. Various selection and subsetting functions are needed, and display functions are used to view data and results. All of these operations take considerable power for databases that store millions or possibly tens of millions of rows. Without adequate processing power, a large or complex data mining effort will come to a halt. The goal of hardware scalability is to provide high performance by adding modestly priced processor building blocks (or nodes) in such a way that performance scales linearly. For example, by doubling the number of processors you can process twice as much data in the same time or the same amount of data in half the time. The hardware vendors have taken two complementary approaches of using multiple CPUs: symmetric multi-processing (SMP) and massively parallel processing (MPP). It is important to understand the different routes to achieving hardware parallelism as they are most frequently used in data mining. A company beginning a data mining effort has to decide whether its computing facilities are adequate, and if not, what steps to take.
In an SMP computer, there is a collection of processing nodes, each with its own private local cache memory. However, there is only a single instance of the operating system, and the disks and main memory are shared through a common bus. The amount of time for any processor to access main memory is equal. A major appeal of SMP is that it is relatively easy to administer, in part because it looks like one computer with a single operating system. It can be used by a DBMS with little if any reprogramming, simply by assigning tasks to CPUs, although modifications such as multithreading based on small grained, light weight threads can let the DBMS take better advantage of SMP. Even greater degrees of scalability can be achieved by moving to one of a variety of MPP architectures. For example, in the so-called "shared nothing" MPP architecture, each node has its own main memory and I/O subsystem, but shares a backplane over which data or requests for services (called functions) can be shipped. Each node in a shared nothing system can be an SMP node, and these can be coupled together to bring hundreds of CPUs to bear on a problem. These are sometimes called SMP clusters, and should not be confused with clustered disk access systems in which a group of CPUs share an I/O system. 2. Parallel algorithmsData mining algorithms written for a uniprocessor machine won't automatically run faster on a parallel machine; they must be rewritten to take advantage of the parallel processors. There are two basic ways of accomplishing this. In the first method, independent pieces of the application are assigned to different processors. The more processors, the more pieces can be executed without reducing throughput. This is called inter-model parallelism. This kind of scale-up is also useful in building multiple independent models. For example, a neural net application could build multiple models using different architectures (e.g., with a different number of nodes or hidden layers) simultaneously on each processor. But what happens if building each model takes a long time? We then need to break this model into tasks, execute those tasks on separate processors, and recombine them for the answer. This second method is called intra-model parallelism. 3. Native database access and parallel DBMSsAs we saw earlier, large data mining problems can require large amounts of database access. This requirement can come from such operations as mining a large amount of data, needing to sample a large database, building many different models, or needing to retrieve many samples when testing and validating the model If the data from an existing database can be mined, then native DBMS access will provide the best possible performance. Most data warehouses today make use of parallel database management systems. However, because the data mining process may involve data from multiple databases, or subsets of data from a data warehouse, it is usually a good idea to create a data mart for purposes of mining that data. This avoids the problem of consolidating data every time you want to use it. While in some cases the data mart may use a special-purpose DBMS developed for the data mining tool, in most cases it will use a standard DBMS. In either case, the size of the data mart or allowing for growth in the size of the data mart will require the data store to be parallelized. A scalable data mining tool should be able to directly access the data mart DBMS, rather than require that the data to be mined has been written to a flat file. Ideally, the tool uses the native SQL of the database server, which maximizes performance and takes advantage of individual server features such as stored procedures and SQL extensions. No single product, however, can support the large variety of database servers, so a gateway must be used for all but the four or five leading DBMSs. The most common gateway supported is Microsoft's ODBC (Open Database Connectivity). A parallel DBMS can assign individual SQL statements to CPUs. This is called inter-query parallelism. Alternatively, it may break up a large SQL statement and assign pieces of it to different CPUs. This is called intra-query parallelism. Both of these approaches are analogous to how data mining models are parallelized. If there is more than one data stream, it's possible for some operations to proceed simultaneously. For example, a customer database being mined could be spread across multiple disks, and a thread could process each subset of the customer data. This is called partitioned parallelism. Because data can be partitioned into many subsets, this technique provides a much greater opportunity for performance improvement. All of these types of parallel processing can happen in either SMP or shared-nothing MPP architectures. The algorithms for good SMP performance, however, are different than those for good shared nothing performance. A good partitioning scheme is an essential part of designing a database that will benefit from parallelism. Partitioned data benefits from indexed access just like non-parallel databases do. The indexes may also be partitioned. To get the best results of partitioning, the allocation scheme must be carefully designed to minimize skew and data movement. The choices will largely be determined by the queries run against the database, and the inevitable tradeoffs will mean some queries will run faster than others. In particular, any partitioning scheme can run afoul of unanticipated queries, and the parallel optimizer needs to be able to deal with these variations. A good parallel optimizer has a host of techniques to minimize the need for data shipping, but, even in a shared nothing architecture, sometimes considerable amounts of data will need to be shipped. There are so many variables among the hardware configurations and data partitioning that it is impossible to draw conclusions from vendor benchmarks. V. ConclusionData mining is an important method for extracting valuable information from all sizes of databases: large and small. Data miners need fast response time when building their models so they can construct the most effective model. Three factors, however, make developing a data mining model a potentially lengthy process:
The solution to these large-scale problems is through applying parallel technology. Scalable data mining tools take advantage of hardware parallelism, including parallel algorithms as well as direct access to parallel DBMSs. Scalable Data Mining RequirementsThe following is a list of items to consider when evaluating the need for scalable data mining solutions.
Scalable Data Mining ToolsThe following list summarizes those features of a data mining tool that make it scalable.
|
| Home | | | About Data Mining | | | Publications | | | Seminars | | | Consulting | | | About Two Crows |