Data Cleaning
Removing duplicates
This is a data cleaning tutorial.
Data cleaning is a crucial step in the data analysis process to ensure accuracy and reliability in your results. Pandas provides several techniques for cleaning data. Within this tutorial, we explore tackling duplicates and data transformation (dealing with data type conversion).
To deal with duplicated vales, Pandas offers its function “duplicated()“. This way, we identify duplicates within a dataframe. After identification of duplicates, we can drop (remove) them by applying the function “drop_duplucates()“. Let’s see how both functions work.
First, we create a sample dataframe (including duplicates).
import pandas as pd # Sample DataFrame (with duplicated values) data = {'Name': ['Ben', 'Samantha', 'Ben', 'Aleksandra'], 'Age': [22, 38, 22, 18]} df = pd.DataFrame(data) print(df)
Name Age 0 Ben 22 1 Samantha 38 2 Ben 22 3 Aleksandra 18
Now, you can use the first function.
duplicates = df.duplicated() print(duplicates)
0 False 1 False 2 True 3 False dtype: bool
Let’s remove all duplicates. By default, this method takes into consideration all columns when removing values.
no_duplicates = df.drop_duplicates() print(no_duplicates)
Name Age 0 Ben 22 1 Samantha 38 3 Aleksandra 18
The third row (index 2) has been deleted. To specify columns, use the “subset” argument. For example, “no_duplicates = df.drop_duplicates(subset=[‘Name’, ‘Age’])“.
Changing data types
Data type changing, also type conversion, refers to the process of converting the data type of dataframe columns. For instance, converting an age column from featuring integers to floats. Pandas provides a convenient method, “astype()“.
Keep in mind that when converting data types, it’s important to handle potential errors that may arise if the data cannot be converted.
First, let’s create a sample dataframe and check the data type of all columns with “dtypes“.
import pandas as pd # Sample DataFrame data = {'X': ['1', '2', '3', '4', '5'], 'Y': ['6.1', '7.2', '8.3', '9.4', '10.5']} df = pd.DataFrame(data) # Data types of the columns print(df.dtypes)
X object Y object dtype: object
The current data type of both columns equals object. We can change it by using the “astype()” function.
# Convert column 'A' to integer df['X'] = df['X'].astype(int) # Convert column 'B' to float df['Y'] = df['Y'].astype(float) # Data types of the columns print(df.dtypes)
Let’s check the data types again.
X int32 Y float64 dtype: object
This is an original data cleaning educational material created by aicorr.com.
Next: Data Inspection