Pandas

Data Loading and Handling

Importing data

This is a data loading and handling tutorial.

Importing data with Pandas is a very common process, especially in data manipulation and analysis. Data comes from various format sources, from CSV to JSON files. Pandas offers very convenient methods for dealing with different data imports. Let’s explore the most common ways.

We have created sample files for each format, named “test.extension“.

CSV files

import pandas as pd

# Read CSV file
df_csv = pd.read_csv('test.csv')

# Display first few rows
print(df_csv.head())
   Index Column 1 Column 2
0      0     data     data
1      1     data     data

Excel files – this method requires openpyxl (pip install openpyxl)

import pandas as pd

# Read Excel file
df_excel = pd.read_excel('test.xlsx', sheet_name='Sheet1')

# Display first few rows
print(df_excel.head())
   Index Column 1 Column 2
0      0     data     data
1      1     data     data

JSON files

import pandas as pd

# Read JSON file
df = pd.read_json('test.json')

# Display entire file
print(df) 
   Index Column 1 Column 2
0      0     data     data
1      1     data     data

Handling missing data

In real-world scenarios, missing data is a must. Perfection does not exist when it comes to data, especially with large datasets. Very often, some data is missing. Pandas provides easy and efficient methods for dealing with such data. The process of handling missing data involves detection, removal, and imputation. Imputation simply refers to the act of replacing (or filling) the missing values. There are various ways imputing values. In this section, we cover some basic techniques.

First, let’s create a sample dataset (with some missing values). The keyword None creates Null values (or no values).

import pandas as pd

# DataFrame with missing values
data = {'X': [3, 2, None, 3, 2],
        'Y': [5, None, 9, 6, 7],
        'Z': [12, 11, None, None, 12]}

df = pd.DataFrame(data)

print(df)

Dataset: 3 features (columns) and 5 records (rows).

     W    X     Y
0  3.0  5.0  12.0
1  2.0  NaN  11.0
2  NaN  9.0   NaN
3  3.0  6.0   NaN
4  2.0  7.0  12.0

In Pandas, NaN stands for Not a Number values.

Detection

The first step of handling missing data is detection. Pandas provides the “isnull()” function, which simply detects Null values by outputting True, or otherwise False if value exists.

# Detect missing values
missing_values = df.isnull()

print(missing_values)
       W      X      Y
0  False  False  False
1  False   True  False
2   True  False   True
3  False  False   True
4  False  False  False

Removal

In some cases, instead of filling missing data, it may be best to simply remove it. Pandas can remove values by rows, columns, or all. The following are the different methods of removing data. Pandas use the function “dropna()“.

Remove rows with any missing values (default)

# Remove rows with any missing values (default)
df_removed = df.dropna()

print(df_removed)
     W    X     Y
0  3.0  5.0  12.0
4  2.0  7.0  12.0

Remove columns with any missing values

# Remove columns with any missing values
df_removed = df.dropna(axis=1)

print(df_removed)
# All columns have empty values - dataframe becomes empty
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]

Delete rows where all values are missing

# Remove rows where all values are missing
df_removed = df.dropna(how='all')

print(df_removed)
# Datafame does not have any rows with ALL values missing
     W    X     Y
0  3.0  5.0  12.0
1  2.0  NaN  11.0
2  NaN  9.0   NaN
3  3.0  6.0   NaN
4  2.0  7.0  12.0

Remove rows with missing values in specific columns

# Remove rows with missing values in specific columns
df_removed = df.dropna(subset=['X'])

print(df_removed)
     W    X     Y
0  3.0  5.0  12.0
2  NaN  9.0   NaN
3  3.0  6.0   NaN
4  2.0  7.0  12.0

Imputation

In other scenarios, instead of removing missing values, imputation can provide effective ways of filling such data. There are various techniques for filling missing data. Here, we cover some simple methods. Pandas’s function for filling values is “fillna()“.

Reminder of the below basics:

  • Constant – a number that does not change and continues the same
  • Mean – the average number of a collection of numbers
  • Median – the middle number separating the lower and higher half of a collection of numbers
  • Mode – the number appearing most frequently in a collection of numbers

Impute missing values with a constant (0 in this case)

# Impute missing values with a constant (0 in this case)
df_imputed_constant = df.fillna(0) 

print(df_imputed_constant)
     W    X     Y
0  3.0  5.0  12.0
1  2.0  0.0  11.0
2  0.0  9.0   0.0
3  3.0  6.0   0.0
4  2.0  7.0  12.0

Impute missing values with mean (mean of column X in this case)

# Impute missing values with mean (mean of column X)
mean_value = df['X'].mean()
df_imputed_mean = df.fillna(mean_value)  

print(df_imputed_mean)
      W     X      Y
0  3.00  5.00  12.00
1  2.00  6.75  11.00
2  6.75  9.00   6.75
3  3.00  6.00   6.75
4  2.00  7.00  12.00

Fill missing values with median (median of column X in this case)

# Impute missing values with median (median of column X)
median_value = df['X'].median()
df_imputed_median = df.fillna(median_value)

print(df_imputed_median)
     W    X     Y
0  3.0  5.0  12.0
1  2.0  6.5  11.0
2  6.5  9.0   6.5
3  3.0  6.0   6.5
4  2.0  7.0  12.0

Impute missing values with mode (mode of column X in this case)

# Impute missing values with mode (mode of column X)
mode_value = df['X'].mode()[0] 
df_imputed_mode = df.fillna(mode_value)

print(df_imputed_mode)
     W    X     Y
0  3.0  5.0  12.0
1  2.0  5.0  11.0
2  5.0  9.0   5.0
3  3.0  6.0   5.0
4  2.0  7.0  12.0

This is an original data loading and handling educational material created by aicorr.com.

Next: Data Cleaning

by AICorr Team

We are proud to offer our extensive knowledge to you, for free. The AICorr Team puts a lot of effort in researching, testing, and writing the content within the platform (aicorr.com). We hope that you learn and progress forward.