The load stage of the ETL process depends largely on what you intend to do with the data once it’s loaded into the data warehouse. Uses could include:
- Layering a business intelligence or analytics tool on top of the warehouse
- Creating a tool for site search
- Building a machine learning algorithm to detect fraud
- Implementing a real-time alerting system
Regardless of your end goal, one of the key considerations during the load process is understanding the work you’re requiring of the target environment. Depending on your data volume, structure, target, and load type, you could negatively impact the host system when you load data.
For example, loading data into Amazon Redshift is best done infrequently in large batches. If you’re loading data into Redshift, you should avoid small, frequent batches or you’ll have angry analysts beating down your door when they notice that your jobs are consuming all of their cluster resources.
Bottom line: The load process needs to be specific to what you’re loading data into. We’re going to move forward with the assumption that you’re loading data into an analytics warehouse.
There are two primary methods to load data into a warehouse:
- Full load: entire data dump that takes place the first time a data source is loaded into the warehouse
- Incremental load: delta between target and source data is dumped at regular intervals. The last extract date is stored so that only records added after this date are loaded. Incremental loads come in two flavors that vary based on the volume of data you’re loading:
- Streaming incremental load – better for loading small data volumes
- Batch incremental load – better for loading large data volumes
|Full load||Incremental load|
|Rows sync||All rows in source data||New and updated records only|
|Time||More time||Less time|
|Difficulty||Low||High. ETL must be checked for new/updated row. Recovery from an issue is harder|
The initial full load is relatively straightforward. When you start taking on incremental loads, things get more complex. Here are three of the most common problem areas:
- Ordering: To handle massive scale with high availability, data pipelines are often distributed systems. This means that arriving data points can take different paths through the system, which means they can be processed in a different order than they were received. If data is being updated or deleted, processing in the wrong order will lead to bad data. Maintaining and auditing ordering is critical for keeping data accurate.
- Schema evolution: What happens to your existing data when a new property is added, or an existing property is changed? Some of these changes can be destructive or leave data in an inconsistent state. For example, what happens if your data warehouse starts receiving string values for a field that is expected to be an integer datatype?
- Monitorability: With data coming from a large number of sources, failures are inevitable. How long will it take you to catch them? Failure scenarios include:
- An API is down for maintenance
- API credentials expire
- API calls are returning successfully, but do not contain any data
- Network congestion prevents communication with an API
- The pipeline destination (e.g. a data warehouse) is offline
Any of these problems will likely result in data that is either incomplete or wrong. Recovering from these issues can be a massive headache.
Keep Learning about ETL Loading
Loading Data into the Warehouse – In an explanation of three types of loads (data already archived, data contained in existing applications, and the trapping of ongoing changes to the operational environment from the last time data was loaded into the data warehouse) this study concludes that “Easily the most complex and important of these types of loads is the loading of changes.”
Part 1 and part 2 of the results of Amazon Redshift database benchmarks – Speed is a huge consideration when evaluating the effectiveness of a load process. These two mini-studies analyze COPY performance with compressed files, and the COPY performance of CSV, JSON, and Avro.
ETL vs. ELT for data warehousing – “In the case of ETL, data is moved to an intermediate platform where the transformation rules are applied before passing the data along to the data warehouse. By contrast, ELT uses standard transfer mechanisms such as FTP to transfer the data directly to the data warehouse infrastructure.”
Methods of Incremental Loading in Data Warehouse – This tutorial explains the incremental, or delta, loading method, which loads only records that have not yet been read and loaded into the data warehouse.
How to Dramatically Decrease Data Warehouse Load Times with Memory-Optimized Tables – To highlight the pros of memory-optimized tables for loading, this article compares them with the older BULK INSERT command. The author also explains a trick he learned to execute T-SQL commands asynchronously, offering it as a valid alternative or complementary process.