Working with Text Data
Working with text
This is a working with text data tutorial.
Working with text data in Pandas involves various operations such as string manipulation, pattern matching, and text analysis. Pandas provides a range of methods for handling text data efficiently.
Within this tutorial, we cover the following methods:
- str.lower() – coverts text to lowercase
- str.upper() – coverts text to uppercase
- str.len() – calculates the length of each text
- str.split().str.len() – counts the number of words in each text
- str.contains() – checks if a substring is present in the text
- str.findall() – finds patterns in the text
- str.split() – splits the text into multiple columns
- str.extract() – extracts specific substrings from the text
- str.replace() – replaces a substring with another string
- str.split().explode().value_counts() – analyses word frequency in the text
These are only a partial of the methods within Pandas’ library frame. For more, please refer to the original documentation here.
First, we create a sample text data.
import pandas as pd # Sample data data = { 'ProductID': [1, 2], 'Description': [ 'Check out our latest product release.', 'Limited time offer on this product!' ] } df = pd.DataFrame(data) print(df)
ProductID Description 0 1 Check out our latest product release. 1 2 Limited time offer on this product!
Please note that each (most of them) of the examples creates a separate column (feature). As a result, the dataframe expands with each method instead of modifying the original data.
Lowercase conversion
df['Description_Lower'] = df['Description'].str.lower() print(df['Description_Lower'])
# Original data 0 Check out our latest product release. 1 Limited time offer on this product! # Modified 0 check out our latest product release. 1 limited time offer on this product!
Uppercase conversion
df['Description_Upper'] = df['Description'].str.upper() print(df['Description_Upper'])
# Original data 0 Check out our latest product release. 1 Limited time offer on this product! # Modified 0 CHECK OUT OUR LATEST PRODUCT RELEASE. 1 LIMITED TIME OFFER ON THIS PRODUCT!
Length calculation
df['Description_Length'] = df['Description'].str.len() print(df['Description_Length'])
# Original data 0 Check out our latest product release. 1 Limited time offer on this product! # Modified 0 37 1 35
Word count
df['Word_Count'] = df['Description'].str.split().str.len() print(df['Word_Count'])
# Original data 0 Check out our latest product release. 1 Limited time offer on this product! # Modified 0 6 1 6
Contains
contains = df['Description'].str.contains('latest') print(contains)
# Original data 0 Check out our latest product release. 1 Limited time offer on this product! # Modified 0 True 1 False
Regex matching
# Find words with exactly 5 characters pattern = r'\b\w{5}\b' matches = df['Description'].str.findall(pattern) print(matches)
# Original data 0 Check out our latest product release. 1 Limited time offer on this product! # Modified 0 [Check] 1 [offer]
Splitting
split_description = df['Description'].str.split(expand=True) print(split_description)
# Original data 0 Check out our latest product release. 1 Limited time offer on this product! # Modified 0 1 2 3 4 5 0 Check out our latest product release. 1 Limited time offer on this product!
Text extraction
df['Product_Type'] = df['Description'].str.extract(r'(product \w+)') print(df['Product_Type'])
# Original data 0 Check out our latest product release. 1 Limited time offer on this product! # Modified 0 product release 1 NaN
Replace
df['Description_Replaced'] = df['Description'].str.replace('product', 'item') print(df['Description_Replaced'])
# Original data 0 Check out our latest product release. 1 Limited time offer on this product! # Modified 0 Check out our latest item release. 1 Limited time offer on this item!
Word frequency
word_frequency = df['Description'].str.split().explode().value_counts() print(word_frequency)
# Original data 0 Check out our latest product release. 1 Limited time offer on this product! # Modified Description Check 1 out 1 our 1 latest 1 product 1 release. 1 Limited 1 time 1 offer 1 on 1 this 1 product! 1
This is an original working with text data educational material created by aicorr.com.
Next: Advanced Data Operations