What is Data Cleaning and How Does it Work?

Another morning seated in front of the computer. The clean desk and the minimalist style that decors my dining room hides a secret: the messiness of my dataset.

After several hours of unfair struggle against an emotionless piece of code which refuses to obey me, I finally manage to break it and bend to my will. The data cleaning process is finished and the dataset is ready for analysis.

The sun starts peeking out the window reminding me that it’s time to go. Alexa’s sexy voice confirms it. It’s 6:30 AM and I have to leave my la la land and go to work.

Data cleaning is a process

Data cleaning is the process of transforming raw data and make it sufficiently standard to be analyzed. This modification process aims to turn a dataset that has discrepancies and errors, into a consistent piece of data that you can manipulate and torture until they confess to anything. From a certain perspective, data cleaning is like making your house ready for guests. You clean, you change, you adjust… You do whatever is necessary to be guest ready.

Cleaning the house
(Data) Cleaning is Fun!

Data cleaning is a major concern

Data cleaning plays an important role in any data science project because the accuracy of the results depend on the data we use. If you input quality, you output quality. If you input trash, you output trash. It’s that simple.

No matter how well your data collection process is designed and implemented, you’ll have to deal with faulty raw data. Some common examples situations that you’ll have when working with real-life datasets are:

    • Data in different units/formats because they come from different sources. For example, one of your sources defines house’s area using square meters, while other sources can define it in square feet.
    • Data contains wrong data types. Instead of having numbers stored as numbers, they are stored as strings.
  • Data has wrong category labels. Most probably, your computer thinks that ‘portugal’ and ‘Portugal’ belong to different categories. You’ve to convince it otherwise.

Data cleaning lives between data understanding and modeling

In their analysis of the data mining process, Wirth and Hipp (2000) propose a six phases model that encapsulates the life cycle of a data mining project. This model is named CRISP-DM (CRoss Industry Standard Process for Data Mining) and it is shown in Figure 1.

Data mining process (CRISP-DM)
Figure 1. Phases of the CRISP-DM Process Model for Data Mining [1]
Ignoring the nuances of the language and some small details, we can use CRISP-DM to organize our thoughts on the data science process. If we do so, data cleaning fits into the ‘Data Preparation’ box. This means that we will perform data cleaning after understanding our data (e.g. knowing about data sources, goals, or processes) and before we start modeling the problem. Just like knowing more about your guests before hosting them.

Data cleaning is a unique iterative process

Data cleaning involves exploration and interactions. You’ll often detect data discrepancies/errors during data preparation activities other than data cleaning (e.g. exploratory data analysis or feature engineering). This is an iterative process.

It is also a process that varies from dataset to dataset. Every project has different business needs, different data collection types, and different types of data that turn each data cleaning process into a unique piece of work. There’s no single guide to cover all the data cleaning steps we might run into.

Nonetheless, you can have a structured approach to data cleaning. There are several common steps in data cleaning. By planning them timely, you can actively search for discrepancies and errors in your raw data and thus make a more efficient data cleaning analysis.

Data cleaning can be approached in a structured way

The goal of this series of blog posts about data cleaning is to provide you a framework that you can use in all your data cleaning analysis. This framework divides data cleaning problems into three different types:

  1. General
    1. Data types
    2. Data standardization
    3. Constant features
    4. Duplicated rows
    5. Duplicated features
    6. Values out of range
    7. Shuffle dataset
  2. Missing data
    1. Basic strategies
    2. Advanced strategies
  3. Outliers
    1. Statistical approaches
    2. Unsupervised approaches

For the following weeks, I’ll be writing about each type of problem. I’ll follow a problem-solution presentation style. For each problem, I’ll present an example and a possible solution to the problem it refers. References to relevant literature and Python packages will be provided throughout.

References

[1] Wirth, R. and Hipp, J., 2000, April. CRISP-DM: Towards a standard process model for data mining. In Proceedings of the 4th international conference on the practical applications of knowledge discovery and data mining (pp. 29-39). Citeseer.

Leave a Reply

Your email address will not be published. Required fields are marked *