Data Science

Data Pipeline

What are pipelines?

A pipeline is a method, in which a set of processes are connected and executed together. Just like a normal tube-shaped pipeline through which gas or liquid flows, and as the name suggests, pipelines in computing help data and various processes flow. Such methods allow for automation and transformation of data.

The following is a basic representation of a pipeline.
Note: Diagrams of pipelines can differ depending on the type of industry and project.

example of a pipeline

Types of Data Pipelines

Pipelines categorisation divide into two types, batch and streaming pipelines. The type of pipeline characterises the underlying structure as well as elemental processes of the data pipeline.

  • Batch data pipeline is more traditional, and it processes batches of data periodically. This means, data is gathered, loaded, and processed at certain times (hourly, daliy, etc.) – hence used in cases when data does not need to be processed an updated continuously. Batch data pipelines execute jobs in off-peak times and are more reliable and flexible. Common use cases: low-frequent reporting, accounting, sales, system updates, billing and payroll, and more.
  • Streaming data pipelines process and update data continuously – often considered as real-time or near real-time pipelines. This means that data is analysed and processed as soon as it is generated. Streaming data pipelines are more time-sensitive, but not as reliable as batch data pipelines. Streaming processes also have lower latency than batch processing. Common use cases: critical reporting, fraud detection, sensor systems, cyber security, and more.

Structure of Data Pipelines

As seen from the diagram above, data pipelines flow through different stages. Such structure can be split into three main parts: absorbing, processing, and storing.

  1. Absorption refers to the sourcing or ingesting of data. This stage deals with collecting data from different sources.
  2. The second part, processing, handles transformative data processes (such as formatting).
  3. The final part of the data pipeline structure deals with storing the data.

Previous: Big Data