Transformation refers to the cleansing and aggregation that may need to happen to data to prepare it for analysis. Architecturally speaking, there are two ways to approach ETL transformation:
- Multistage data transformation – This is the classic extract, transform, load process. Extracted data is moved to a staging area where transformations occur prior to loading the data into the warehouse.
- In-warehouse data transformation – In this approach, the process flow changes to something more like ELT. Data is extracted and loaded into the analytics warehouse, and transformations are done there.
In recent years, there has been a shift toward transforming data within the warehouse rather than transforming it beforehand, primarily driven by two factors:
- The increased performance and scalability of the modern analytics database
- The ability for in-database transformations to be written in SQL, the data manipulation language of choice for most analysts
Today, this approach has enough advantages that it has become the default; when projects decide to use the traditional model, this decision should come with real consideration of the costs and benefits.
ETL transformation types
Regardless of where in the process transformation takes place, it’s an important step in the analytic workflow. Transformations prepare the data for analysis. Here are some of the most common types:
- Basic transformations:
- Cleaning: Mapping NULL to 0 or "Male" to "M" and "Female" to "F," date format consistency, etc.
- Deduplication: Identifying and removing duplicate records
- Format revision: Character set conversion, unit of measurement conversion, date/time conversion, etc.
- Key restructuring: Establishing key relationships across tables
- Advanced transformations:
- Derivation: Applying business rules to your data that derive new calculated values from existing data – for example, creating a revenue metric that subtracts taxes
- Filtering: Selecting only certain rows and/or columns
- Joining: Linking data from multiple sources – for example, adding ad spend data across multiple platforms, such as Google Adwords and Facebook Ads
- Splitting: Splitting a single column into multiple columns
- Data validation: Simple or complex data validation – for example, if the first three columns in a row are empty then reject the row from processing
- Summarization: Values are summarized to obtain total figures which are calculated and stored at multiple levels as business metrics – for example, adding up all purchases a customer has made to build a customer lifetime value (CLV) metric
- Aggregation: Data elements are aggregated from multiple data sources and databases
- Integration: Give each unique data element one standard name with one standard definition. Data integration reconciles different data names and values for the same data element.
Keep Learning about ETL Transformation
Data Transformation – This simple explanatory page includes an easy-to-follow example that shows why it’s crucial that critical data be converted correctly when it’s transferred from an application to a database.
Data Warehousing Fundamentals: A Comprehensive Guide for IT Professionals – This book features an informative section on transformations that groups transformation techniques into “basic tasks” and “major transformation types.”
Loading and Transformation in Data Warehouses – This article goes into detail about the difference between multistage and pipelined data transformation, featuring diagrams that show the process from flat files to tables.