This stage refers to the cleansing and aggregation that may need to happen to data to prepare it for analysis. Architecturally speaking, there are two ways to approach ETL transformation:
- Multi-stage data transformation - this approach follows the classic extract, transform, load process. Extracted data is moved to a staging area where transformations occur prior to loading it into the warehouse.
- In-warehouse data transformation - in this approach, the process flow changes a bit to looking something more like ELT. Data is extracted and loaded into the analytics warehouse, and transformations are done there.
In recent years, there has been a shift towards transforming data within the warehouse rather than transforming it beforehand. This is primarily driven by two factors:
- The increased performance and scalability of the modern analytic database
- The ability for in-database transformations to be written in SQL, the data manipulation language of choice for most analysts.
Today, this approach has enough advantages that it should be considered the default; when projects decide to use the traditional model, this decision should come with real consideration as to the costs and benefits.
ETL Transformation Types
Regardless of where in the process transformation takes place, it’s an important step in the analytic workflow. Transformations prepare the data for analysis. Here are some of the most common types:
- Basic transformations:
- Cleaning: Mapping NULL to 0 or "Male" to "M" and "Female" to "F", date format consistency, etc.
- Deduplication: Identifying and removing duplicate records
- Format revision: Character set conversion, unit of measurement conversion, date/time conversion, etc.
- Key restructuring: Establishing key relationships across tables
- Advanced transformations:
- Derivation: Applying business rules to your data that derive new calculated values from existing data. For example, creating a revenue metric that subtracts taxes
- Filtering: Selecting only certain rows and/or columns
- Joining: Linking data from multiple sources. For example, adding ad spend data across multiple platforms like Google Adwords and Facebook Ads.
- Splitting: Splitting a single column into multiple columns
- Data validation: Simple or complex data validation (e.g., if the first 3 columns in a row are empty then reject the row from processing)
- Summarization: Values are summarized to obtain total figures which are calculated and stored at multiple levels as business metrics. For example, adding up all purchases a customer has made to build a Customer Lifetime Value (CLV) metric.
- Aggregation: Data elements are aggregated from multiple data sources and databases.
- Integration: Give each unique data element one standard name with one standard definition. Data integration reconciles different data names and values for the same data element.
Keep Learning about ETL Transformation
Data Transformation - This simple explanation also includes an easy-to-follow example that shows exactly why it’s crucial that critical data is converted correctly and isn’t lost when it’s transferred from an application to a database.
Data Warehousing Fundamentals: A Comprehensive Guide for IT Professionals - This book features a robust section on transformations that groups transformation techniques into “basic tasks” and “major transformation types.”
Loading and Transformation in Data Warehouses - This article goes into more detail about the difference between multi-stage and pipelined data transformation, featuring diagrams that show the process from flat files to tables.