Data Inspection
Analysing data
This is a data inspection tutorial.
When aiming to analyse or modify data, it is always beneficial to inspect it first. Data inspection in Pandas involves examining the structure, content, and basic statistics of a DataFrame to gain insights into the data. As a result, you can better understand your data and identify potential issues or patterns. Pandas offers several common techniques for inspecting information. Let’s dive into them.
First, let’s create a random dataframe (10 columns and 10 rows). For this, we will use the combination of Pandas and NumPy. If you are not familiar with NumPy, please refer to our tutorials here.
import pandas as pd
import numpy as np
# Sample DataFrame with random data
data = {
'A': np.random.rand(10),
'B': np.random.randint(0, 100, 10),
'C': np.random.choice(['X', 'Y', 'Z'], 10),
'D': np.random.randn(10),
'E': np.random.uniform(1, 10, 10)
}
df = pd.DataFrame(data)
print(df)
A B C D E 0 0.977025 23 Z 0.032601 2.115461 1 0.700341 86 X 0.540359 8.837162 2 0.919997 11 Z 0.597639 2.506773 3 0.340381 74 X 0.335716 2.205103 4 0.006617 77 X 1.214410 8.628857 5 0.861336 38 Z -0.051425 1.277405 6 0.975448 52 Y -0.868918 7.873193 7 0.037024 42 Y -0.660432 2.030910 8 0.935821 10 Y -1.793465 5.806170 9 0.506367 14 Z 0.540431 3.938597
Basic inspection
Now, we can start with some basic inspection methods.
- head() – displays first few rows
- tail() – displays last few rows
- sample() – displays random samples
- dtypes – shows each column’s data type
- columns – shows all column names
- shape – shows the dimensionality in an object
- size – shows the number of elements in an object
# View the first 3 rows (5 by default) print(df.head(3)) # View the last 3 rows (5 by default) print(df.tail(3)) # Sample 3 rows print(df.sample(n=3)) # Get the column names print(df.columns) # Get data types print(df.dtypes) # Get dimensionality print(df.shape) # Get number of elements print(df.size)
# head()
A B C D E
0 0.977025 23 Z 0.032601 2.115461
1 0.700341 86 X 0.540359 8.837162
2 0.919997 11 Z 0.597639 2.506773
# tail()
A B C D E
7 0.037024 42 Y -0.660432 2.030910
8 0.935821 10 Y -1.793465 5.806170
9 0.506367 14 Z 0.540431 3.938597
# sample()
A B C D E
7 0.037024 42 Y -0.660432 2.030910
6 0.975448 52 Y -0.868918 7.873193
4 0.006617 77 X 1.214410 8.628857
# dtypes
A float64
B int32
C object
D float64
E float64
dtype: object
#columns
Index(['A', 'B', 'C', 'D', 'E'], dtype='object')
# shape
(10, 5)
# size
50
Further inspection
- value_counts() – returns all column value counts
- nunique() – checks for unique values
- isnull() – check for NULL values
- info() – generates detailed information about an object
- describe() – generates descriptive statistics about an object
# Get value counts for a specific column print(df['A'].value_counts()) # Check for unique values print(df.nunique()) # Show NULL values print(df.isnull()) # Details about the object print(df.info()) # Descriptive statistics about the object print(df.describe())
# value_counts()
A
0.977025 1
0.700341 1
0.919997 1
0.340381 1
0.006617 1
0.861336 1
0.975448 1
0.037024 1
0.935821 1
0.506367 1
Name: count, dtype: int64
# nunique()
A 10
B 10
C 3
D 10
E 10
dtype: int64
# isnull()
A B C D E
0 False False False False False
1 False False False False False
2 False False False False False
3 False False False False False
4 False False False False False
5 False False False False False
6 False False False False False
7 False False False False False
8 False False False False False
9 False False False False False
# info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 10 non-null float64
1 B 10 non-null int32
2 C 10 non-null object
3 D 10 non-null float64
4 E 10 non-null float64
dtypes: float64(3), int32(1), object(1)
memory usage: 488.0+ bytes
None
# describe()
A B D E
count 10.000000 10.000000 10.000000 10.000000
mean 0.626036 42.700000 -0.011308 4.521963
std 0.382257 28.724941 0.878057 2.993704
min 0.006617 10.000000 -1.793465 1.277405
25% 0.381878 16.250000 -0.508180 2.137872
50% 0.780839 40.000000 0.184159 3.222685
75% 0.931865 68.500000 0.540413 7.356437
max 0.977025 86.000000 1.214410 8.837162
This is an original data inspection educational material created by aicorr.com.
Next: Data Manipulation

