Pandas

Advanced Data Operations


Merging dataframes

This is an advanced data operations tutorial.

Merging (joining) DataFrames in Pandas allows you to combine data from different sources based on one or more common keys. There are four main different joins: inner, left, right, outer. Pandas implement these operations through the “how” and “on” parameters.

  • Inner join – only includes data with matching keys in both dataframes
  • Left join – includes all data from the left dataframe and all matching keys from the right dataframe
  • Right join – includes all data from the right dataframe and all matching keys from the left dataframe
  • Outer (full) join – includes all data from both dataframes

For better comprehension, below is a basic representation of all joins.

SQL-inner-join-left-outer-join-right-outer-join-full-outer-join

Let’s try an example, with the inner join.

We have two dataframes, one containing information about students and their test scores, and another containing information about students’ demographics.

import pandas as pd

# Sample dataframe 1
df_scores = pd.DataFrame({
    'student_id': [1, 2, 3, 4, 5],
    'test_score': [85, 90, 88, 92, 95]
})

# Sample dataframe 2
df_demographics = pd.DataFrame({
    'student_id': [1, 2, 3, 4, 6],
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'age': [20, 21, 20, 22, 23],
    'gender': ['F', 'M', 'M', 'M', 'F']
})

Now, let’s merge these dataframes based on the ‘student_id‘ column.

# Inner Join
merged_inner = pd.merge(df_scores, df_demographics, on='student_id', how='inner')
print(merged_inner)
   student_id  test_score     name  age gender
0           1          85    Alice   20      F
1           2          90      Bob   21      M
2           3          88  Charlie   20      M
3           4          92    David   22      M

In this inner join, only students with matching IDs in both dataframes are included (1 – 4 student ids). Student id 5 and 6 do not appear in both dataframes. Therefore, they do not appear in the merge.

Pivot and reshaping dataframes

Pivot tables and reshaping data are powerful techniques in Pandas. They transform data into a more structured form for analysis. Let’s work with an example.

Firstly, we create a sample data (sales data).

import pandas as pd

# Sample data
data = {
    'Date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02', '2024-01-03'],
    'Product': ['A', 'B', 'A', 'B', 'A'],
    'Sales': [100, 200, 150, 250, 120]
}

df = pd.DataFrame(data)
print(df)
         Date Product  Sales
0  2024-01-01       A    100
1  2024-01-01       B    200
2  2024-01-02       A    150
3  2024-01-02       B    250
4  2024-01-03       A    120

Secondly, we create a pivot table to summarise sales data by date and product.

pivot_table = df.pivot_table(index='Date', columns='Product', values='Sales', aggfunc='sum')
print(pivot_table)
Product         A      B
Date                    
2024-01-01  100.0  200.0
2024-01-02  150.0  250.0
2024-01-03  120.0    NaN

In this pivot table, each row represents a date, each column represents a product, and the values are the total sales.

And finally, we can also reshape the data using stack() and unstack() to convert between long and wide formats. As a result, this can help with better visualisation or further analysis.

# Wide format
wide_format = pivot_table.unstack().reset_index(name='Sales')
print(wide_format)

# Long format
long_format = wide_format.pivot(index='Date', columns='Product', values='Sales')
print(long_format)
  Product        Date  Sales
0       A  2024-01-01  100.0
1       A  2024-01-02  150.0
2       A  2024-01-03  120.0
3       B  2024-01-01  200.0
4       B  2024-01-02  250.0
5       B  2024-01-03    NaN


Product         A      B
Date                    
2024-01-01  100.0  200.0
2024-01-02  150.0  250.0
2024-01-03  120.0    NaN

This is an original advanced data operations educational material created by aicorr.com.

Next: Data Visualisation with Pandas