Doing Data Cleaning Like The Pros

In a previous post, we introduced the data cleaning topic. We defined data cleaning as the ‘process of transforming raw data and make it sufficiently standard to be analyzed’. We also said that there are three types of problems in data cleaning: general, missing data, and outliers.

This post focus on general data cleaning problems. These problems can be:

  1. Data types.
  2. Data standardization.
  3. Constant features.
  4. Duplicated rows.
  5. Duplicated features.
  6. Values out of range.
  7. Shuffle dataset.

In the following sections, we will see examples of each of these problems, as well as ways to work around them. All the examples are accessible on GitHub.

If you want to know more about data cleaning, you can also study one of these books: Python Data Analysis or Machine Learning with Python Cookbook.

Data types

Data type tells Python how do you want to use the data. For example, if you say that you want to use the data as a string, Python will not allow you to perform mathematical calculations on it. That’s what you get by defining data.

There are three main types of data:

  • Numeric. Numeric data refers to integers (e.g. 10) and floats (e.g. 10.1).
  • Character. Character data refers to strings, which can contain numbers and/or characters (e.g. ‘ab32’).
  • Date/time. Date/time refers to data holding information about time (e.g. 2018-01-05 20:23:02). Note that date/time data is not a conventional type of data, but it is common in data science problems.

An alternative way to classify data types is to distinguish between: 

  • Continuous. We say that a feature is continuous if, between any two values, it has an infinite number of values. The height of a person is an example of a continuous feature. Continuous features can be numeric or date/time.
  • Categorical. On the other hand, categorical features are those who can only take a limited – and usually fixed – number of values. For example, the gender of a person is a categorical feature (Male or Female). A categorical feature is said to be ordinal if its order matters. Tickets class is an example of a categorical feature because there’s an order associated with its possible values (1st class > 2nd class > 3rd class). Categorical features can be numeric or character.

Table 1 shows how Pandas name these data types.

Alternative type Type Pandas type
Continuous Numeric int64, float64
Date/time datetime64, timedelta[ns]
Categorical Numeric int64, float64
Character object

Table 1. Data types.

Fixing features’ data type is one of the first steps in data cleaning. Only after this, you can proceed to different tasks, such as deduplication (e.g. character data), encoding (e.g. categorical data), or others.

In the following notebook, we can learn how to check the format of our data and how to change it.

Data standardization

Constant features

Constant features are those features that have always the same value. If the feature has a constant value, the features should be deleted because it’s not providing any valuable information to your model.

See how to do it in the following example.

Duplicated rows

We say that we have a duplicated row when we have rows repeating the exact same values. When that happens, we want to keep one of those rows and delete the duplicated rows. That’s what we will do in the following example.

Duplicated features

We have duplicated features when two or more features of our dataset have the exact same value in all observations. In these cases, we should keep only one of those features and delete the remaining ones. See how in the example.

Values out of range

In this context, values out of range will be all the values that are obviously wrong. For example, we do not expect an age variable to be negative, or to have a Benfica supporter that is not a champion. Cleaning values out of range should be done before starting any kind of statistical analysis. Learn how in the following example.

Shuffle dataset

We shuffle our dataset to avoid bias/pattern situations. If we follow proper modeling procedures, this should not be an issue. However, it’s good to be aware of it and make sure that our dataset is shuffled. The following example shows how things should be done.


2 thoughts on “Doing Data Cleaning Like The Pros

Leave a Reply

Your email address will not be published. Required fields are marked *