Is Your “Dirty” Data Typical?
At MindEcology, when we engage a new client for an analysis, we always start with a look at their database in its raw state. Depending upon the project, the data could consist of customer names, mailing lists, order history, or product catalogs, for example.
It is the rule, rather than the exception, that most customer databases that our clients bring to us have not been well-maintained. Whether the data we receive has been stored in a POS (point of sale) system, an online shopping cart application, an MS Excel file, or an MS Access or SQL database – a large portion of most companies’ data is “dirty” data.
Dirty data refers to a large data set which contains a possible wide assortment of imperfections, including missing data values, incorrect value types (e.g., text instead of numeric data for a given field), values that are out of the possible range, incomplete records, etc.
In fact, many of our MindEcology clients are almost apologetic when they grant us accesss to their database for the first time. It is almost as if they have invited us over for a seven-course meal but want us to “please excuse our messy house – it’s been a crazy week.”
Given that this is such a common experience for us, we are never surprised when a new client’s database is found to have not been maintained to ideal standards. Some examples of all-too-common database problems include:
* in an address file (or table): address data showing up in the phone number field
* in an order history file: existence of blank (zero-value) revenue fields for some orders
* in a customer name file: first and last names are transposed for some records
* in a product list file: missing product IDs for products that show up in the order history file (etc.)
Fortunately, in mosts cases this is a total non-issue once we get started on a project. That is because, before we actually start analyzing a database for any project, we carry out what’s called “data cleansing” and, when necessary, data imputation.
Data cleansing involves:
1. validating codes (data values) against a list of acceptable values and deleting or fixing each one as necessary 2. deleting very “dirty” records 3. de-duplicating and merging records
Meanwhile, data imputation involves filling in missing values with intuitive data, such as with a reasonable estimate (this is better than leaving the item blank). There are a number of techniques for doing this.
These techniques are industry-standard in the data mining and modeling fields and help ensure that we have good, clean, well-organized data as we begin the analysis phase of a new project.