ETL Extract

During the extract phase of ETL, the desired data sources are identified and the data is extracted from those data sources. These sources will likely include:

  1. Transactional databases hosted on-site or in the cloud
  2. Hosted business applications

Part of the planning for this stage should include capacity planning for data volumes from each data source. You will make significantly different plans for every stage of your ETL process if your data sizes are 100GB or 100 PB. 

It’s also of paramount importance to extract data in a way that doesn’t have a negative impact on the source systems or response times. 

Data Extraction

Data extraction most commonly happens in one of these three ways:

  1. Update notification - the easiest way to extract data from a source system is most often to have that system notify you when a record has been changed. Most databases provide a mechanism for this so that they can support database replication, and many SaaS business applications provide webhooks which offer conceptually similar functionality.
  2. Incremental extraction - some systems are unable to provide notification that an update has occurred, but they are able to identify which records have been modified, and then provide an extract on those records. During subsequent ETL steps, the system needs to identify changes and propagate them down. One of the drawbacks of incremental extraction, is that it may not be possible to detect deleted records in source data.
  3. Full extraction - some systems are not able to identify which data has been changed at all, so reloading all data is the only way to get the data out of the system. This extraction method will require you to keep a copy of the last extract in the same format so you can check which records are new. Because of the data transfer volumes required for this approach, it’s recommended to only use this as a last resort, and only for small tables.

In the past, the ETL process was largely concerned with extracting data from transactional databases. The rise of SaaS products is changing this. Many companies today rely on a host of SaaS tools – Salesforce, Google Analytics, Google Adwords, Quickbooks, Zuora, and many more – to run their business. 

The extraction process for nearly every SaaS product out there relies on integrating with their API. APIs introduce a few unique challenges to the ETL process: 

  • The integration is different from what you’ve built in the past. Not every product provides a vanilla REST API. Some REST APIs are surprisingly convoluted, and some are still stuck on protocols like SOAP.
  • The API is not rigorously and accurately documented. Even APIs from reputable, developer-friendly companies sometimes have poor documentation.
  • The API has a large surface area. For example, a big CRM, like Salesforce, or ERP system might have dozens of different built-in resource endpoints along with custom resources and fields, all of which add to the complexity.
  • APIs are constantly changing (and breaking). Once you have the data connections set up, they break. Once you have built more than a handful of connections, you’ll spend a significant amount of engineering time addressing breaking API changes.

For example, Facebook’s “move fast and break things” approach to development means frequent updates to their reporting APIs. It’s not like you’ll get notified in advance either, unless your team invests in building a close enough relationship with the team building the API to get a “through the grapevine” heads-up.
 


Keep Learning about ETL Extraction

Extraction - This straightforward article explains the logical data map (or lineage report), its components, and goes through the methods of capturing changes in data.

Extraction in Data Warehouses - This chapter of Oracle’s Data Warehousing guide details each of the extractions methods, including physical extraction methods, as well as Change Data Capture. 

Data Extraction Methods in Medium and Large Databases - This paper focuses on data extraction using four “lookup table” techniques. In working with a 10,000 record data set, these techniques are evaluated and compared against one another.