Développez l'apprentissage automatique prédictif avec Flink | Atelier du 18 déc. | S'inscrire

What is ETL?

ETL stands for “Extract, Transform, and Load.” It is a set of common processes for collecting, integrating, and distributing data to make it available for additional purposes, such as analytics, machine learning, reporting, or other business purposes. ETL enables an organization to carry out data-driven analysis and decision-making using operational data.

Built by the original creators of Apache Kafka, Confluent powers scalable, reliable, real-time streaming ETL across 120+ data sources.

Diagram illustrating extract transform load (ETL)

The ETL Process Explained

At its core, ETL is a three-step process: (1) Data is collected from one or more sources, then (2) is transformed to apply the required business logic to the data, including cleaning it, remodeling it, and joining it to other data. (3) Finally, it is stored in its new destination (such as a data warehouse or non-relational database). This process is essential for effective data integration and accurate data analysis.

(Note: A common subvariant of ETL is ELT (extract load transform). ETL performs the transformation step before loading the data, while ELT performs the transformation step after loading the data.)

ETL is not new. In fact, it’s evolved quite a bit from the 1970s and 1980s, where the process was sequential, data was more static, systems were monolithic, and reporting was only needed on a weekly or monthly basis.

Step 1: Extract Data

In the first step of ETL, the focus is on understanding where useful original data is stored, and what form or format it is in. This data can be in different formats and structures, including databases, files, applications, and real-time data streams. Then, processes are implemented to access it, through recurring batch processes, in real-time or triggered by specific events or actions.

This step involves:

  • Collecting data from structured sources like relational databases, and unstructured sources such as text files and log data.
  • Accessing data from transactional databases and data streams to ensure comprehensive data collection.
  • Ensuring that the extracted data includes all necessary information for further processing (data analysis and integration).

Step 2: Transform Data

In this step, the original data is cleaned, restructured, and aggregated to ensure it is in the correct format for storage in a data warehouse or other sources. This transformation process prepares the data for analysis, reporting, and use by various business applications.

Some of the key transformation processes include:

  • Deriving calculated values: Creating new data points based on raw data, such as calculating totals or averages.
  • Reordering or transposing the data: Changing the order or structure of data to fit the required format.
  • Adding metadata or associating key-value pairs: Enriching the data with additional information to provide more context.
  • Removing repetitive data or adding counts of occurrences of data: Identifying and eliminating duplicates to ensure data quality.
  • Encoding or decoding data: Converting data formats to meet the destination system's requirements.
  • Validating the data: Ensuring data accuracy and completeness through checks and validations.
  • Performing search and replace: Modifying data values to standardize or correct entries.
  • Changing the schema: Transforming data structures, such as converting text fields to numerical IDs or values.

Step 3: Load Data into Storage

In this step, the transformed data is stored in one or more locations where it can be accessed by applications, reporting tools, and other business processes. This storage can include unstructured object stores, text files, or complex data warehouses.

The loading process varies widely depending on the nature of the business requirements and the applications and users the data serves.

Some of the key considerations in this step include:

  • Destination variety: Data can be stored in various types of repositories, including relational databases, data warehouses, data lakes, and cloud storage solutions.
  • Access requirements: Data is accessible to the necessary applications and users for analytics, reporting, and other business processes.
  • Performance optimization: Loading data in a way that maximizes query performance and minimizes latency.
  • Data integrity: Ensuring the data remains accurate and consistent during the loading process.
  • Scalability: Implementing processes that can handle growing data volumes efficiently.

What are the benefits of ETL?

ETL offers several key benefits that support data integration, business intelligence, and broader data management strategies:

Data Integration

ETL enables organizations to gather data from multiple data sources, combine it, and then distribute it where it’s needed. This allows different types of data to work together and supports data-driven analysis and decision-making using operational data.

Business Intelligence

ETL plays a critical role in getting the right data to the right place for business intelligence. By consolidating data from various sources, ETL processes enable more comprehensive and accurate reporting, allowing businesses to make informed decisions.

Data Management

Common use cases for ETL include moving data from transactional or operational databases into a data warehouse or data lake for analysis and reporting. This process makes it easier and more efficient to analyze large amounts of data than would otherwise be possible in an operational database.

Data Cleansing and Transformation

ETL processes clean and transform data to make it easier to work with. This involves removing inconsistencies, standardizing formats, and ensuring data quality, which is essential for accurate analysis.

Types of ETL

Reverse ETL

Reverse ETL (rETL), also known as “Operational Analytics,” reverses the traditional ETL flow by sending cleaned and transformed data from a data warehouse into a business application. This allows organizations to access analytical data in an operational context, enabling better data-driven decisions.

ELT vs ETL

ELT (Extract, Load, and Transform) is a variation of ETL where data is extracted from one or more sources, loaded into a destination database, and then transformed in place using the destination’s processing capabilities. This process is used to quickly load and integrate data from multiple sources without the need for extensive preprocessing.

The main difference between ETL and ELT is the order of data transformation and loading steps. ETL transforms data before loading, while ELT performs the transformation after loading the data into the destination.

EtLT

EtLT is a sub-variation of ELT where data undergoes some basic transformations between the Extract and Load stages. However, major transformations supporting business logic or data modeling are reserved for the final transform stage. This approach balances initial data processing with the comprehensive transformations required for end-use.

Common ETL tools

Over the years, many ETL tools have been developed, each with its own unique strengths and perspectives. ETL tool developers include database vendors, cloud service providers, business intelligence/reporting vendors, and independent providers focusing exclusively on data integration.

Database ETL Tools

Database vendors often provide ETL tools to enable better data integration within their product suite. Examples of database ETL tools include:

  • Oracle Data Integrator
  • Microsoft SQLServer Integration Services
  • DB2 Web Query DataMigrator ETL Extension
  • IBM Data Integration

Cloud ETL Tools

Cloud Service Providers provide ETL capabilities within their offering to facilitate loading data into their services and integration of data amongst their services. Examples include:

  • AWS Glue: Serverless data integration in AWS
  • AWS Data Pipeline: Managed ETL service for AWS
  • Azure Data Factory: Fully managed serverless data integration for Azure
  • Google Cloud Dataflow: Unified serverless batch and stream processing engine for GCP

BI ETL Tools

Business Intelligence and Enterprise Reporting vendors provide ETL tools designed to provide data for business intelligence and reporting operations. Examples include:

  • BIRT Analytics Loader
  • SAP Business Objects Data Integrator
  • IBM Cognos Data Manager

ETL Data Integration Tools

Data integration is an industry segment of its own. A large subset of that is ETL. Examples of ETL data integration tools include:

  • Airbyte
  • Pentaho Data Integration
  • Talend Open Studio
  • Hevo Dat
  • Informatica
  • Alteryx
  • Fivetran
  • Matillion

Custom ETL Solutions

In our experience, a significant amount of ETL is performed without tools. Instead, custom ETL processes are often hand-written in code using a mix of SQL queries, database triggers, stored procedures, and scripting in languages such as Python, Perl, Java, server-side JavaScript, or Bash. As data volumes have expanded and data variety has increased, we’ve also seen custom ETL implemented with combinations of Apache Hadoop MapReduce or Apache Spark.

Challenges

Implementing a reliable, robust, and secure ETL process is challenging. This involves accessing, transforming, and storing data from multiple sources.

Here are some of the key challenges:

  • Permissions: Do your networks and systems have access and rights to the data?
  • Data freshness: Are you capturing real-time data, or stale data that's no longer of value? What is the ephemeral nature of data? Are you able to capture it before the data passes its lifetime?
  • Data quality and integrity: Do you have validation in place to ensure the extracted data is in an expected form? Combining data from multiple sources can be challenging due to differences in data formats, structures, and definitions. It can be difficult to identify and correct errors or inconsistencies in the data. Reasoning about the lineage of data (i.e., what the original source was) can be difficult once data sources are integrated.
  • Availability and scale: Is there enough storage and computing power in your staging area to keep up with the data? (The more data that needs to be transformed, the more computationally and storage intensive it can become.)
  • Filtering: Which data is important data and which can be ignored or discarded?
  • Security and privacy: ETL processes often involve sensitive or confidential data. Organizations want to ensure that data is protected and accessed only by authorized users, with increased complexity when dealing with large volumes of data or data from multiple sources.
  • Data governance and compliance: Data governance policies and regulations stipulate how data provenance, privacy, and security are to be maintained. Additional complexity arises from the integration of complex data sets or data subject to multiple disparate regulations.
  • Data transformation and cleansing: ETL processes often require significant data transformation and cleansing in order to prepare data for analysis or integration with other systems.
  • Integration with other systems: ETL processes often involve integrating data with other systems, such as data warehouses, analytics platforms, or business intelligence tools. This can be challenging, particularly when dealing with complex data structures or integration requirements.

Apache Kafka: A Complete Streaming ETL Platform

Apache Kafka® is a distributed streaming platform that is the core of modern enterprise architectures. Kafka is a flexible and scalable foundation for building ETL pipelines that connect distributed systems and event-driven applications, allowing businesses to process data in real-time and at scale:

  • Kafka Connectors run within the Kafka Connect framework and enable developers to extract data from one system and load it into another

  • Kafka Streams API gives applications the stream processing capabilities to transform data, one message or event at a time. These transformations can include joining multiple data sources, filtering data, and aggregating data over a period of time

  • Confluent Schema Registry provides data governance with Avro and other schemas

  • Confluent Control Center provides central data management

  • ksqlDB is a database purpose-built for stream processing applications, providing a familiar, lightweight syntax to capture, process, and serve queries using only SQL.

  • Stream Designer provides a graphical canvas to build pipelines powered by Apache Kafka®, with SQL extensibility.

In contrast to Batch ETL, Kafka enables ETL pipelines that are streaming, allowing real-time transformation and distribution of data at large scale.