Skip to main content
Established 2009

Data Preparation

Once the quality assessment step has been completed, it is time to prepare your data for analysis. The data preparation step of the CRISP-DM methodology mainly involves cleaning up your data so that it is ready for analysis. This step can involve one or more sub-steps, including data cleansing, data transformation, data imputation, data filtering, data reduction, data sampling, dimensionality reduction, and data sampling.

Data Cleansing

Clean data is usable data. The reality in business is that very few ? if any – data sets are completely clean and ready for analysis. Nearly all databases contain some errors. The data cleansing step involves, for example:

  • Removing erroneous data, such as incorrect or mistakenly-entered characters
  • Aligning data within the proper fields
  • Correcting data type errors

Data Transformation

This optional step involves altering data by, for example, multiplying values by a coefficient in order to create values within a certain range or to convert values to a logarithmic value or an index score.

Data Imputation

Almost all large data sets have at some missing values. Data imputation involves filling in values in cases whereby a ?dummy? value would serve better than would eliminating the entire record.

Data Filtering

This refers to potentially removing outliers (i.e., values that are contextually too large or too small for the data set) or eliminating other data that might corrupt the results of the analysis.

Data Reduction

It is not always the case that more data is better. Sometimes, it is desirable or necessary to reduce the sheer volume of data being manipulated for analysis. In these cases, a data reduction technique can be employed to reduce the overall size of the data set. Some common techniques include data sampling, dimensionality reduction and data discretization.

Data Sampling

This involves taking a statistically-significant random sample of a very large data set in order to reduce the number of records for analysis, while preserving the accuracy of the results within very stringent tolerance ranges.

Dimensionality Reduction

Some data sets contain variables that are not likely to add to the value of the outcome of an analysis. The data fields containing these variables can be removed from the data set before an analysis is executed.

Data Discretization

Some data fields can contain a very large range of continuous variables, within nearly every possible value along the continuum included. However, some data mining algorithms are not set up to handle continuous variables. In these cases, the data can be discretized, which means putting the continuous values into discrete buckets or categories. In this way, the algorithm in question can have the opportunity to handle 5 or 10 possible discrete values rather than hundreds or thousands of continuous ones.

Typically, the Data Discovery phase is followed by our Predictive Modeling phase. We then implement our findings in the form of a marketing campaign via our full set of Marketing Services.

Contact MindEcology today to get started.