During the extract phase of ETL, someone in the organization identifies the desired data sources and the rows, columns, and fields to be extracted from those sources. These sources likely include:
- Transactional databases hosted on-site or in the cloud
- Hosted business applications
Part of the planning for this stage should include estimating data volumes from each data source. You’ll make significantly different plans for every stage of your ETL process if your data sizes are 100 gigabytes versus 100 petabytes.
You must also extract data in a way that doesn’t have a negative impact on the source systems or response times.
Data extraction commonly happens in one of three ways:
- Update notification – The easiest way to extract data from a source system is to have that system notify you when a record has been changed. Most databases provide a mechanism for this so that they can support database replication, and many SaaS applications provide webhooks, which offer conceptually similar functionality.
- Incremental extraction – Some systems are unable to provide notification that an update has occurred, but they are able to identify which records have been modified, and provide an extract on those records. During subsequent ETL steps, the system needs to identify changes and propagate them down. One of the drawbacks of incremental extraction is that it may not be possible to detect deleted records in source data.
- Full extraction – Some systems are not able to identify which data has been changed at all, so reloading all data is the only way to get data out of the system. This extraction method requires you to keep a copy of the last extract so you can check which records are new. Because this approach involves high data transfer volumes, it’s recommended you use it only as a last resort, and only for small tables.
In the past, the ETL process was largely concerned with extracting data from transactional databases. The prevalence of SaaS products has changed this. Many companies today rely on a host of SaaS tools – Salesforce, Google Analytics, Google Adwords, Facebook Ads, Zendesk, HubSpot, and many more – to run their businesses.
The extraction process for nearly every SaaS product relies on integrating with its APIs. APIs introduce a few challenges to the ETL process:
- Integration is different for every application. Not every product provides a vanilla REST API. Some REST APIs are surprisingly convoluted, and some applications are still stuck on protocols like SOAP.
- Many APIs are not rigorously and accurately documented. Even APIs from reputable, developer-friendly companies sometimes have poor documentation.
- An API may have a large surface area. An enterprise application might have dozens of built-in resource endpoints, along with custom resources and fields, all of which add to the complexity.
- APIs are constantly changing (and breaking). Once you have data connections set up, they break. Once you’ve built more than a handful of connections, you’ll spend a significant amount of engineering time addressing breaking API changes.
For example, Facebook’s “move fast and break things” approach to development means the company frequently updates its reporting APIs – and it doesn't notify API users in advance, unless your team invests in building a close-enough relationship with the team building the API to get a “through the grapevine” heads-up.
Keep Learning about ETL Extraction
Extraction – This page introduces the logical data map (or lineage report) and its components, and goes through the methods of capturing changes in data.
Extraction in Data Warehouses – This chapter of Oracle’s Data Warehousing Guide details several extractions methods, including physical extraction methods, as well as change data capture.
Data Extraction Methods in Medium and Large Databases – This paper focuses on data extraction using four “lookup table” techniques. Working with a 10,000-record data set, these techniques are evaluated and compared against one another.