In a previous post, we introduced the data cleaning topic. We defined data cleaning as the ‘process of transforming raw data and make it sufficiently standard to be analyzed’. We also said that there are three types of problems in data cleaning: general, missing data, and outliers.
This post focus on general data cleaning problems. These problems can be:
- Data types.
- Data standardization.
- Constant features.
- Duplicated rows.
- Duplicated features.
- Values out of range.
- Shuffle dataset.
In the following sections, we will see examples of each of these problems, as well as ways to work around them. All the examples are accessible on GitHub.
Data type tells Python how do you want to use the data. For example, if you say that you want to use the data as a string, Python will not allow you to perform mathematical calculations on it. That’s what you get by defining data.
There are three main types of data:
- Numeric. Numeric data refers to integers (e.g. 10) and floats (e.g. 10.1).
- Character. Character data refers to strings, which can contain numbers and/or characters (e.g. ‘ab32’).
- Date/time. Date/time refers to data holding information about time (e.g. 2018-01-05 20:23:02). Note that date/time data is not a conventional type of data, but it is common in data science problems.
An alternative way to classify data types is to distinguish between:
- Continuous. We say that a feature is continuous if, between any two values, it has an infinite number of values. The height of a person is an example of a continuous feature. Continuous features can be numeric or date/time.
- Categorical. On the other hand, categorical features are those who can only take a limited – and usually fixed – number of values. For example, the gender of a person is a categorical feature (Male or Female). A categorical feature is said to be ordinal if its order matters. Tickets class is an example of a categorical feature because there’s an order associated with its possible values (1st class > 2nd class > 3rd class). Categorical features can be numeric or character.
Table 1 shows how Pandas name these data types.
|Alternative type||Type||Pandas type|
Table 1. Data types.
Fixing features’ data type is one of the first steps in data cleaning. Only after this, you can proceed to different tasks, such as deduplication (e.g. character data), encoding (e.g. categorical data), or others.
In the following notebook, we can learn how to check the format of our data and how to change it.
Constant features are those features that have always the same value. If the feature has a constant value, the features should be deleted because it’s not providing any valuable information to your model.
See how to do it in the following example.
We say that we have a duplicated row when we have rows repeating the exact same values. When that happens, we want to keep one of those rows and delete the duplicated rows. That’s what we will do in the following example.
We have duplicated features when two or more features of our dataset have the exact same value in all observations. In these cases, we should keep only one of those features and delete the remaining ones. See how in the example.
Values out of range
In this context, values out of range will be all the values that are obviously wrong. For example, we do not expect an age variable to be negative, or to have a Benfica supporter that is not a champion. Cleaning values out of range should be done before starting any kind of statistical analysis. Learn how in the following example.
We shuffle our dataset to avoid bias/pattern situations. If we follow proper modeling procedures, this should not be an issue. However, it’s good to be aware of it and make sure that our dataset is shuffled. The following example shows how things should be done.