The load stage of the ETL process is going to be largely dependent on what you intend to do with the data once it is loaded into the host source. This could include:
- Layering a business intelligence or analytics tool on top of the warehouse
- Creating a tool for site search
- Building a machine learning algorithm to detect fraud
- Implementing a real-time alerting system
Regardless of your end goal, one of the key considerations during the load process is understanding the work you’re requiring of target environment. Depending on your data volume, structure, target, and load type there is the possibility of negatively impacting the host system.
For example, loading data into Amazon Redshift is best done infrequently in large batches. If you’re loading data into Redshift, you’ll want to avoid small, frequent batches or you’ll have 10 angry analysts beating down your door when they notice your jobs are constantly consuming all of their cluster resources.
Bottom line: the load process needs to be specific to what you’re loading data into. We’re going to move forward with the ETL discussion assuming you’re loading data into an analytics warehouse.
There are two primary methods to load data into a warehouse:
- Full load: entire data dump that takes place the first time a data source is loaded into the warehouse.
- Incremental load: delta between target and source data is dumped at regular intervals. The last extract date is stored so that only records added after this date are loaded. Incremental loads come in two flavors that will vary based on the volume of data you’re loading:
- Streaming incremental load -- better for smaller data volumes
- Batch incremental load -- better for loading large data volumes
|Full load||Incremental load|
|Rows sync||All rows in source data||New and updated records only|
|Time||More time||Less time|
|Difficulty||Low||High. ETL must be checked for new/updated row, recovery from an issue is harder|
The initial full load is relatively straightforward. Complexity is introduced to the ETL process when you start taking on incremental loads. Here are three of the most common problem areas:
- Ordering: To handle massive scale with high availability, data pipelines are often distributed systems. This means that arriving data points can take different paths through the system, which also means they can be processed in a different order than they were received. If data is being updated or deleted, processing in the wrong order will lead to bad data. Maintaining and auditing ordering is critical for keeping data accurate.
- Schema evolution: What happens to your existing data when a new property is added, or an existing property is changed? Some of these changes can be destructive or leave data in an inconsistent state. For example, what happens if your host source starts receiving string values for a field that is expected to be an integer datatype?
- Monitorability: With data coming from such a large number of sources, failures are inevitable. The question is, how long will it take you to catch them? Failure scenarios include:
- An API is down for maintenance
- API credentials expire
- API calls are returning successfully, but do not actually contain any data
- There’s network congestion preventing communication with an API
- The pipeline destination (e.g. a data warehouse) is offline
Any of these problems will likely result in bad data that is either incomplete or wrong. Recovering from these issues can be a massive headache.
Keep Learning about ETL Loading
Loading Data Into the Warehouse - In an explanation of the three types of loads (data already archived, data contained in existing applications, and the trapping of ongoing changes to the operational environment from the last time data was loaded into the data warehouse) this study concludes that “Easily the most complex and important of these types of loads is the loading of changes.”
Part 1 & part 2 of the Redshift Database Benchmarks - Speed is a huge consideration when evaluating the effectiveness of a load process. These two mini-studies analyze COPY performance with compressed files, and the COPY performance of of CSV, JSON, and Avro.
ETL vs. ELT for data warehousing - “In the case of ETL, data is moved to an intermediate platform where the transformation rules are applied before passing the data along to the data warehouse. By contrast, ELT uses standard transfer mechanisms such as FTP to transfer the data directly to the data warehouse infrastructure.”
Methods of Incremental Loading in Data Warehouse - This tutorial explains the popular incremental, or delta, loading method, which only loads records that have not yet been read and loaded into the data warehouse.
How to Dramatically Decrease Data Warehouse Load Times with Memory-Optimized Tables - To highlight the pros of memory-optimized tables for loading, the older “BULK INSERT” command, is evaluated in comparison. Additionally, the author explains a trick he learned to execute T-SQL commands asynchronously, offering it as a valid alternative or complementary process.