Data Inspection
Analysing data
This is a data inspection tutorial.
When aiming to analyse or modify data, it is always beneficial to inspect it first. Data inspection in Pandas involves examining the structure, content, and basic statistics of a DataFrame to gain insights into the data. As a result, you can better understand your data and identify potential issues or patterns. Pandas offers several common techniques for inspecting information. Let’s dive into them.
First, let’s create a random dataframe (10 columns and 10 rows). For this, we will use the combination of Pandas and NumPy. If you are not familiar with NumPy, please refer to our tutorials here.
import pandas as pd import numpy as np # Sample DataFrame with random data data = { 'A': np.random.rand(10), 'B': np.random.randint(0, 100, 10), 'C': np.random.choice(['X', 'Y', 'Z'], 10), 'D': np.random.randn(10), 'E': np.random.uniform(1, 10, 10) } df = pd.DataFrame(data) print(df)
A B C D E 0 0.977025 23 Z 0.032601 2.115461 1 0.700341 86 X 0.540359 8.837162 2 0.919997 11 Z 0.597639 2.506773 3 0.340381 74 X 0.335716 2.205103 4 0.006617 77 X 1.214410 8.628857 5 0.861336 38 Z -0.051425 1.277405 6 0.975448 52 Y -0.868918 7.873193 7 0.037024 42 Y -0.660432 2.030910 8 0.935821 10 Y -1.793465 5.806170 9 0.506367 14 Z 0.540431 3.938597
Basic inspection
Now, we can start with some basic inspection methods.
- head() – displays first few rows
- tail() – displays last few rows
- sample() – displays random samples
- dtypes – shows each column’s data type
- columns – shows all column names
- shape – shows the dimensionality in an object
- size – shows the number of elements in an object
# View the first 3 rows (5 by default) print(df.head(3)) # View the last 3 rows (5 by default) print(df.tail(3)) # Sample 3 rows print(df.sample(n=3)) # Get the column names print(df.columns) # Get data types print(df.dtypes) # Get dimensionality print(df.shape) # Get number of elements print(df.size)
# head() A B C D E 0 0.977025 23 Z 0.032601 2.115461 1 0.700341 86 X 0.540359 8.837162 2 0.919997 11 Z 0.597639 2.506773 # tail() A B C D E 7 0.037024 42 Y -0.660432 2.030910 8 0.935821 10 Y -1.793465 5.806170 9 0.506367 14 Z 0.540431 3.938597 # sample() A B C D E 7 0.037024 42 Y -0.660432 2.030910 6 0.975448 52 Y -0.868918 7.873193 4 0.006617 77 X 1.214410 8.628857 # dtypes A float64 B int32 C object D float64 E float64 dtype: object #columns Index(['A', 'B', 'C', 'D', 'E'], dtype='object') # shape (10, 5) # size 50
Further inspection
- value_counts() – returns all column value counts
- nunique() – checks for unique values
- isnull() – check for NULL values
- info() – generates detailed information about an object
- describe() – generates descriptive statistics about an object
# Get value counts for a specific column print(df['A'].value_counts()) # Check for unique values print(df.nunique()) # Show NULL values print(df.isnull()) # Details about the object print(df.info()) # Descriptive statistics about the object print(df.describe())
# value_counts() A 0.977025 1 0.700341 1 0.919997 1 0.340381 1 0.006617 1 0.861336 1 0.975448 1 0.037024 1 0.935821 1 0.506367 1 Name: count, dtype: int64 # nunique() A 10 B 10 C 3 D 10 E 10 dtype: int64 # isnull() A B C D E 0 False False False False False 1 False False False False False 2 False False False False False 3 False False False False False 4 False False False False False 5 False False False False False 6 False False False False False 7 False False False False False 8 False False False False False 9 False False False False False # info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 10 entries, 0 to 9 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 A 10 non-null float64 1 B 10 non-null int32 2 C 10 non-null object 3 D 10 non-null float64 4 E 10 non-null float64 dtypes: float64(3), int32(1), object(1) memory usage: 488.0+ bytes None # describe() A B D E count 10.000000 10.000000 10.000000 10.000000 mean 0.626036 42.700000 -0.011308 4.521963 std 0.382257 28.724941 0.878057 2.993704 min 0.006617 10.000000 -1.793465 1.277405 25% 0.381878 16.250000 -0.508180 2.137872 50% 0.780839 40.000000 0.184159 3.222685 75% 0.931865 68.500000 0.540413 7.356437 max 0.977025 86.000000 1.214410 8.837162
This is an original data inspection educational material created by aicorr.com.
Next: Data Manipulation