Data Loading and Handling
Importing data
This is a data loading and handling tutorial.
Importing data with Pandas is a very common process, especially in data manipulation and analysis. Data comes from various format sources, from CSV to JSON files. Pandas offers very convenient methods for dealing with different data imports. Let’s explore the most common ways.
We have created sample files for each format, named “test.extension“.
CSV files
import pandas as pd # Read CSV file df_csv = pd.read_csv('test.csv') # Display first few rows print(df_csv.head())
Index Column 1 Column 2 0 0 data data 1 1 data data
Excel files – this method requires openpyxl (pip install openpyxl)
import pandas as pd # Read Excel file df_excel = pd.read_excel('test.xlsx', sheet_name='Sheet1') # Display first few rows print(df_excel.head())
Index Column 1 Column 2 0 0 data data 1 1 data data
JSON files
import pandas as pd # Read JSON file df = pd.read_json('test.json') # Display entire file print(df)
Index Column 1 Column 2 0 0 data data 1 1 data data
Handling missing data
In real-world scenarios, missing data is a must. Perfection does not exist when it comes to data, especially with large datasets. Very often, some data is missing. Pandas provides easy and efficient methods for dealing with such data. The process of handling missing data involves detection, removal, and imputation. Imputation simply refers to the act of replacing (or filling) the missing values. There are various ways imputing values. In this section, we cover some basic techniques.
First, let’s create a sample dataset (with some missing values). The keyword None creates Null values (or no values).
import pandas as pd # DataFrame with missing values data = {'X': [3, 2, None, 3, 2], 'Y': [5, None, 9, 6, 7], 'Z': [12, 11, None, None, 12]} df = pd.DataFrame(data) print(df)
Dataset: 3 features (columns) and 5 records (rows).
W X Y 0 3.0 5.0 12.0 1 2.0 NaN 11.0 2 NaN 9.0 NaN 3 3.0 6.0 NaN 4 2.0 7.0 12.0
In Pandas, NaN stands for Not a Number values.
Detection
The first step of handling missing data is detection. Pandas provides the “isnull()” function, which simply detects Null values by outputting True, or otherwise False if value exists.
# Detect missing values missing_values = df.isnull() print(missing_values)
W X Y 0 False False False 1 False True False 2 True False True 3 False False True 4 False False False
Removal
In some cases, instead of filling missing data, it may be best to simply remove it. Pandas can remove values by rows, columns, or all. The following are the different methods of removing data. Pandas use the function “dropna()“.
Remove rows with any missing values (default)
# Remove rows with any missing values (default) df_removed = df.dropna() print(df_removed)
W X Y 0 3.0 5.0 12.0 4 2.0 7.0 12.0
Remove columns with any missing values
# Remove columns with any missing values df_removed = df.dropna(axis=1) print(df_removed)
# All columns have empty values - dataframe becomes empty Empty DataFrame Columns: [] Index: [0, 1, 2, 3, 4]
Delete rows where all values are missing
# Remove rows where all values are missing df_removed = df.dropna(how='all') print(df_removed)
# Datafame does not have any rows with ALL values missing W X Y 0 3.0 5.0 12.0 1 2.0 NaN 11.0 2 NaN 9.0 NaN 3 3.0 6.0 NaN 4 2.0 7.0 12.0
Remove rows with missing values in specific columns
# Remove rows with missing values in specific columns df_removed = df.dropna(subset=['X']) print(df_removed)
W X Y 0 3.0 5.0 12.0 2 NaN 9.0 NaN 3 3.0 6.0 NaN 4 2.0 7.0 12.0
Imputation
In other scenarios, instead of removing missing values, imputation can provide effective ways of filling such data. There are various techniques for filling missing data. Here, we cover some simple methods. Pandas’s function for filling values is “fillna()“.
Reminder of the below basics:
- Constant – a number that does not change and continues the same
- Mean – the average number of a collection of numbers
- Median – the middle number separating the lower and higher half of a collection of numbers
- Mode – the number appearing most frequently in a collection of numbers
Impute missing values with a constant (0 in this case)
# Impute missing values with a constant (0 in this case) df_imputed_constant = df.fillna(0) print(df_imputed_constant)
W X Y 0 3.0 5.0 12.0 1 2.0 0.0 11.0 2 0.0 9.0 0.0 3 3.0 6.0 0.0 4 2.0 7.0 12.0
Impute missing values with mean (mean of column X in this case)
# Impute missing values with mean (mean of column X) mean_value = df['X'].mean() df_imputed_mean = df.fillna(mean_value) print(df_imputed_mean)
W X Y 0 3.00 5.00 12.00 1 2.00 6.75 11.00 2 6.75 9.00 6.75 3 3.00 6.00 6.75 4 2.00 7.00 12.00
Fill missing values with median (median of column X in this case)
# Impute missing values with median (median of column X) median_value = df['X'].median() df_imputed_median = df.fillna(median_value) print(df_imputed_median)
W X Y 0 3.0 5.0 12.0 1 2.0 6.5 11.0 2 6.5 9.0 6.5 3 3.0 6.0 6.5 4 2.0 7.0 12.0
Impute missing values with mode (mode of column X in this case)
# Impute missing values with mode (mode of column X) mode_value = df['X'].mode()[0] df_imputed_mode = df.fillna(mode_value) print(df_imputed_mode)
W X Y 0 3.0 5.0 12.0 1 2.0 5.0 11.0 2 5.0 9.0 5.0 3 3.0 6.0 5.0 4 2.0 7.0 12.0
This is an original data loading and handling educational material created by aicorr.com.
Next: Data Cleaning