Working with Text Data

Working with text

This is a working with text data tutorial.

Working with text data in Pandas involves various operations such as string manipulation, pattern matching, and text analysis. Pandas provides a range of methods for handling text data efficiently.

Within this tutorial, we cover the following methods:

str.lower() – coverts text to lowercase
str.upper() – coverts text to uppercase
str.len() – calculates the length of each text
str.split().str.len() – counts the number of words in each text
str.contains() – checks if a substring is present in the text
str.findall() – finds patterns in the text
str.split() – splits the text into multiple columns
str.extract() – extracts specific substrings from the text
str.replace() – replaces a substring with another string
str.split().explode().value_counts() – analyses word frequency in the text

These are only a partial of the methods within Pandas’ library frame. For more, please refer to the original documentation here.

First, we create a sample text data.

import pandas as pd

# Sample data
data = {
    'ProductID': [1, 2],
    'Description': [
        'Check out our latest product release.',
        'Limited time offer on this product!'
    ]
}

df = pd.DataFrame(data)
print(df)

   ProductID                            Description
0          1  Check out our latest product release.
1          2    Limited time offer on this product!

Please note that each (most of them) of the examples creates a separate column (feature). As a result, the dataframe expands with each method instead of modifying the original data.

Lowercase conversion

df['Description_Lower'] = df['Description'].str.lower()
print(df['Description_Lower'])

# Original data
0    Check out our latest product release.
1      Limited time offer on this product!

# Modified
0    check out our latest product release.
1      limited time offer on this product!

Uppercase conversion

df['Description_Upper'] = df['Description'].str.upper()
print(df['Description_Upper'])

# Original data
0    Check out our latest product release.
1      Limited time offer on this product!

# Modified
0    CHECK OUT OUR LATEST PRODUCT RELEASE.
1      LIMITED TIME OFFER ON THIS PRODUCT!

Length calculation

df['Description_Length'] = df['Description'].str.len()
print(df['Description_Length'])

# Original data
0    Check out our latest product release.
1      Limited time offer on this product!

# Modified
0    37
1    35

Word count

df['Word_Count'] = df['Description'].str.split().str.len()
print(df['Word_Count'])

# Original data
0    Check out our latest product release.
1      Limited time offer on this product!

# Modified
0    6
1    6

Contains

contains = df['Description'].str.contains('latest')
print(contains)

# Original data
0    Check out our latest product release.
1      Limited time offer on this product!

# Modified
0     True
1    False

Regex matching

# Find words with exactly 5 characters
pattern = r'\b\w{5}\b'
matches = df['Description'].str.findall(pattern)
print(matches)

# Original data
0    Check out our latest product release.
1      Limited time offer on this product!

# Modified
0    [Check]
1    [offer]

Splitting

split_description = df['Description'].str.split(expand=True)
print(split_description)

# Original data
0    Check out our latest product release.
1      Limited time offer on this product!

# Modified
         0     1      2       3        4         5
0    Check   out    our  latest  product  release.
1  Limited  time  offer      on     this  product!

Text extraction

df['Product_Type'] = df['Description'].str.extract(r'(product \w+)')
print(df['Product_Type'])

# Original data
0    Check out our latest product release.
1      Limited time offer on this product!

# Modified
0    product release
1                NaN

Replace

df['Description_Replaced'] = df['Description'].str.replace('product', 'item')
print(df['Description_Replaced'])

# Original data
0    Check out our latest product release.
1      Limited time offer on this product!

# Modified
0    Check out our latest item release.
1      Limited time offer on this item!

Word frequency

word_frequency = df['Description'].str.split().explode().value_counts()
print(word_frequency)

# Original data
0    Check out our latest product release.
1      Limited time offer on this product!

# Modified
Description
Check       1
out         1
our         1
latest      1
product     1
release.    1
Limited     1
time        1
offer       1
on          1
this        1
product!    1

This is an original working with text data educational material created by aicorr.com.

Next: Advanced Data Operations

by AICorr Team

We are proud to offer our extensive knowledge to you, for free. The AICorr Team puts a lot of effort in researching, testing, and writing the content within the platform (aicorr.com). We hope that you learn and progress forward.