Time Series Analysis
Introduction to time series data
This is a time series analysis tutorial.
Time series data represents observations or measurements taken at different points in time, typically at regular intervals. This could be anything from stock prices, weather data, economic indicators, sensor readings, or even social media activity over time. Pandas offers powerful tools for handling and analysing time series data efficiently. Let’s dive into the different techniques.
First, we create a sample time series data. We do this through the Pandas’ “date_range()” method. The attribute period refers to the number of periods. Freq deals with offset aliases (in our scenario, D stands for calendar day frequency).
And then, we convert the time series object to a Pandas dataframe.
import pandas as pd # Sample time series with timestamps dates = pd.date_range('2024-01-01', periods=10, freq='D') print(dates) # DataFrame with random data and timestamps as index data = pd.DataFrame({'value': range(10)}, index=dates) print(data)
The following is the outcome of both time series and dataframe objects.
# Time Series Data DatetimeIndex(['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05', '2024-01-06', '2024-01-07', '2024-01-08', '2024-01-09', '2024-01-10'], dtype='datetime64[ns]', freq='D') # Dataframe value 2024-01-01 0 2024-01-02 1 2024-01-03 2 2024-01-04 3 2024-01-05 4 2024-01-06 5 2024-01-07 6 2024-01-08 7 2024-01-09 8 2024-01-10 9
Now, let’s see how to access time series data. Below are a few common methods for accessing data.
# Specific date print(data.loc['2024-01-03']) # Within a range of dates print(data.loc['2024-01-03':'2024-01-07']) # Specific month print(data.loc['2024-01'])
value 2 Name: 2024-01-03 00:00:00, dtype: int64 value 2024-01-03 2 2024-01-04 3 2024-01-05 4 2024-01-06 5 2024-01-07 6 value 2024-01-01 0 2024-01-02 1 2024-01-03 2 2024-01-04 3 2024-01-05 4 2024-01-06 5 2024-01-07 6 2024-01-08 7 2024-01-09 8 2024-01-10 9
Resampling and frequency conversion
Resampling involves changing the frequency of the time series observations. We can upsample (increase frequency) or downsample (decrease frequency) the data.
# Resample to monthly frequency monthly_data = data.resample('M').sum() print(monthly_data) # Resample to monthly frequency monthly_data = data.resample('M').mean() print(monthly_data)
value 2024-01-31 45 value 2024-01-31 4.5
Time shifting and lagging
The time shifting and lagging methods offer time series data manipulation. The techniques shift the data forward or backwards.
Time shifting
This method involves shifting the entire time series by a certain offset (forward or backwards). First, we create a random sample time series data.
import pandas as pd # Sample time series data dates = pd.date_range('2024-01-01', periods=5, freq='D') data = pd.Series([10, 20, 30, 40, 50], index=dates) print(data)
2024-01-01 10 2024-01-02 20 2024-01-03 30 2024-01-04 40 2024-01-05 50 Freq: D, dtype: int64
Then, we shift the data forward as well as backwards. We shift the data by 1 period.
# Forward shift shifted_forward = data.shift(periods=1) print(shifted_forward) # Backward shift shifted_backward = data.shift(periods=-1) print(shifted_backward)
2024-01-01 NaN 2024-01-02 10.0 2024-01-03 20.0 2024-01-04 30.0 2024-01-05 40.0 Freq: D, dtype: float64 2024-01-01 20.0 2024-01-02 30.0 2024-01-03 40.0 2024-01-04 50.0 2024-01-05 NaN Freq: D, dtype: float64
Lagging
This method involves shifting the data backwards in time, often to create lag features for time series analysis. Let’s create a random sampling.
import pandas as pd # Sample time series data dates = pd.date_range('2024-01-01', periods=5, freq='D') data = pd.Series([10, 20, 30, 40, 50], index=dates) print(data)
2024-01-01 10 2024-01-02 20 2024-01-03 30 2024-01-04 40 2024-01-05 50 Freq: D, dtype: int64
Now, we can perform lagging onto the data.
# Lagging lagged_data = data.shift(periods=1) print(lagged_data)
2024-01-01 NaN 2024-01-02 10.0 2024-01-03 20.0 2024-01-04 30.0 2024-01-05 40.0 Freq: D, dtype: float64
Rolling window functions
Rolling window functions in Pandas allow you to perform calculations over a sliding window of data. These functions are particularly useful for tasks like moving averages, smoothing noisy data, or computing rolling statistics in time series analysis.
We cover the calculation of Simple Moving Average (SMA), Exponential Moving Average (EMA), and Rolling Statistics. Let’s dive into each one of them.
Simple moving average (SMA)
The Simple Moving Average is the average of a fixed window of data. First, we create a random sample data. And then, we calculate the SMA.
import pandas as pd # Sample time series data dates = pd.date_range('2024-01-01', periods=10, freq='D') data = pd.Series([10, 20, 30, 40, 50, 60, 70, 80, 90, 100], index=dates) # Simple moving average (window size of 3) sma = data.rolling(window=3).mean() print(sma)
2024-01-01 NaN 2024-01-02 NaN 2024-01-03 20.0 2024-01-04 30.0 2024-01-05 40.0 2024-01-06 50.0 2024-01-07 60.0 2024-01-08 70.0 2024-01-09 80.0 2024-01-10 90.0 Freq: D, dtype: float64
Exponential moving average (SMA)
The Exponential Moving Average gives more weight to recent data points while calculating the average. We create a random sample data. Afterwards, we calculate the EMA.
import pandas as pd # Sample time series data dates = pd.date_range('2024-01-01', periods=10, freq='D') data = pd.Series([10, 20, 30, 40, 50, 60, 70, 80, 90, 100], index=dates) # Exponential moving average (with a span of 3) ema = data.ewm(span=3).mean() print(ema)
2024-01-01 10.000000 2024-01-02 16.666667 2024-01-03 24.285714 2024-01-04 32.666667 2024-01-05 41.612903 2024-01-06 50.952381 2024-01-07 60.551181 2024-01-08 70.313725 2024-01-09 80.176125 2024-01-10 90.097752 Freq: D, dtype: float64
Rolling statistics
Pandas allows the computation of various rolling statistics such as rolling sum, rolling max, rolling min, and so on. Finally, we create a random sample data, and the we perform each rolling statistics technique onto the data.
import pandas as pd # Sample time series data dates = pd.date_range('2024-01-01', periods=10, freq='D') data = pd.Series([10, 20, 30, 40, 50, 60, 70, 80, 90, 100], index=dates) # Sum with a (window size of 3) rolling_sum = data.rolling(window=3).sum() print(rolling_sum) # Maximum with a (window size of 3) rolling_max = data.rolling(window=3).max() print(rolling_max) # Minimum with a (window size of 3) rolling_min = data.rolling(window=3).min() print(rolling_min)
2024-01-01 NaN 2024-01-02 NaN 2024-01-03 60.0 2024-01-04 90.0 2024-01-05 120.0 2024-01-06 150.0 2024-01-07 180.0 2024-01-08 210.0 2024-01-09 240.0 2024-01-10 270.0 Freq: D, dtype: float64 2024-01-01 NaN 2024-01-02 NaN 2024-01-03 30.0 2024-01-04 40.0 2024-01-05 50.0 2024-01-06 60.0 2024-01-07 70.0 2024-01-08 80.0 2024-01-09 90.0 2024-01-10 100.0 Freq: D, dtype: float64 2024-01-01 NaN 2024-01-02 NaN 2024-01-03 10.0 2024-01-04 20.0 2024-01-05 30.0 2024-01-06 40.0 2024-01-07 50.0 2024-01-08 60.0 2024-01-09 70.0 2024-01-10 80.0 Freq: D, dtype: float64
This is an original time series analysis educational material created by aicorr.com.
Next: Working with Text Data