Data Cleaning

Removing duplicates

This is a data cleaning tutorial.

Data cleaning is a crucial step in the data analysis process to ensure accuracy and reliability in your results. Pandas provides several techniques for cleaning data. Within this tutorial, we explore tackling duplicates and data transformation (dealing with data type conversion).

To deal with duplicated vales, Pandas offers its function “duplicated()“. This way, we identify duplicates within a dataframe. After identification of duplicates, we can drop (remove) them by applying the function “drop_duplucates()“. Let’s see how both functions work.

First, we create a sample dataframe (including duplicates).

import pandas as pd

# Sample DataFrame (with duplicated values)
data = {'Name': ['Ben', 'Samantha', 'Ben', 'Aleksandra'],
        'Age': [22, 38, 22, 18]}

df = pd.DataFrame(data)
print(df)

         Name  Age
0         Ben   22
1    Samantha   38
2         Ben   22
3  Aleksandra   18

Now, you can use the first function.

duplicates = df.duplicated()
print(duplicates)

0    False
1    False
2     True
3    False
dtype: bool

Let’s remove all duplicates. By default, this method takes into consideration all columns when removing values.

no_duplicates = df.drop_duplicates()
print(no_duplicates)

         Name  Age
0         Ben   22
1    Samantha   38
3  Aleksandra   18

The third row (index 2) has been deleted. To specify columns, use the “subset” argument. For example, “no_duplicates = df.drop_duplicates(subset=[‘Name’, ‘Age’])“.

Changing data types

Data type changing, also type conversion, refers to the process of converting the data type of dataframe columns. For instance, converting an age column from featuring integers to floats. Pandas provides a convenient method, “astype()“.

Keep in mind that when converting data types, it’s important to handle potential errors that may arise if the data cannot be converted.

First, let’s create a sample dataframe and check the data type of all columns with “dtypes“.

import pandas as pd

# Sample DataFrame
data = {'X': ['1', '2', '3', '4', '5'],
        'Y': ['6.1', '7.2', '8.3', '9.4', '10.5']}

df = pd.DataFrame(data)

# Data types of the columns
print(df.dtypes)

X    object
Y    object
dtype: object

The current data type of both columns equals object. We can change it by using the “astype()” function.

# Convert column 'A' to integer
df['X'] = df['X'].astype(int)

# Convert column 'B' to float
df['Y'] = df['Y'].astype(float)

# Data types of the columns
print(df.dtypes)

Let’s check the data types again.

X      int32
Y    float64
dtype: object

This is an original data cleaning educational material created by aicorr.com.

Next: Data Inspection

by AICorr Team

We are proud to offer our extensive knowledge to you, for free. The AICorr Team puts a lot of effort in researching, testing, and writing the content within the platform (aicorr.com). We hope that you learn and progress forward.

Removing duplicates

Changing data types

by AICorr Team

Related Posts

Advanced Data Operations

Working with Text Data

Time Series Analysis

Pandas