The most common mistake people make when building an ETL system--or arguably any technology project--is that they jump into writing code and buying technology solutions before thinking through the needs of their organization.
What sources do you need to extract data from? What warehousing technology are you using? What’s the analytical skill level within your organization? Why is this project important to your overarching company goals? These are the questions that determine your overall data architecture, a necessary outline to establish prior to starting your project.
Chris Lockhart describes the importance of architecture-first thinking like this:
To Lockhart, architecture is the vision and the planning toward that vision.
Here are the things you should be thinking about when designing your ETL architecture:
- Understand your organizational requirements
- Audit your data sources
- Determine your approach to data extraction
- Build your cleansing machinery
- Manage the ETL process
Understand your organizational requirements
There comes a time in every business's life when joining data across disparate data sets on a one-off basis is no longer tenable: you’re suddenly overloaded with spreadsheets, or you start to calculate how much time you’re spending on perpetually-breaking ETL scripts. Before you jump into a solution, it’s important to make sure you have a good understanding of why ETL is a priority right now for your organization.
- Data Sources: Which data sources does your team most want to work with? How much time are you willing to dedicate to building a new integration when the situation arises? For many startups and growing businesses, data sources are fairly standardized. You’ll likely be extracting data from transactional databases like MongoDB or MySQL or SaaS tools like Salesforce and Google Analytics.
- Usage: Where will the data be loaded, who will be using it, and how will they be using it? If you’re provisioning data for analytics, that’s going to look different than if you’re provisioning data to power your product like Uber or Airbnb.
- Latency: How often do your end users need fresh data? For systems alerting, you’ll need your data to be updated every second. If you’re delivering data for financial reporting, once a week might be enough. If your data scientists want to work on a data model, you may only need to deliver data once.
Answering these foundational questions will help you make the first critical decision about your ETL architecture: should you build it or buy it? If you’re extracting data from standard sources and using it primarily for business analytics, buying it is often the right choice. No need to reinvent the wheel. If you're Uber and you need to make realtime decisions based on the location data from hundreds of thousands of cars, you may need a custom solution.
Audit your data sources
This stage of the architecture process requires that you build a profile of the data sources your business uses. You’ll want to consider the production databases that power your business, as well as the disparate data sets that live in your SaaS tools. Here are some common areas:
|Production Databases||Sales & Marketing Sources||Customer Support Sources||Operational Sources|
It’s important during this process to understand what data sources are a top priority. For most companies, consistent financial data around revenue, MRR, unit economics, etc. will be the top priority. Identifying these priorities will provide you with the information needed to make tradeoffs around data completeness vs. speed of project completion.
Determine your approach to data extraction
There are two questions that will guide your approach to data extraction:
- What sources are you extracting data from?
- What methods of data extraction will you use?
The Data Extraction section of this website covers these points in detail.
Build your cleansing machinery
In the Transform section, we explore how and why improved data warehouse processing has impacted the transformation stage of ETL. Today, modern ETL systems leave the bulk of transformations in the analytics stage, and focus instead on loading minimally-processed data into the warehouse. This increases the flexibility of your ETL process and gives your analysts greater control over data modeling.
That said, there’s still some basic data cleansing that will need to happen during the ETL process. These commonly occurring scenarios need to be handled prior to loading data into your warehouse:
- Restructuring Data: Oftentimes, the way data is structured in a source data store makes it impossible or non-optimal to load it into another data store. A common example of this occurs when translating nested structures from a JSON API or MongoDB into a relational structure.
- Data typing: You’ll need to type the data entering your system and maintain that type (i.e., currency, date) as it travels through your ETL process.
- Data schema evolution: This is a consideration for internal systems, but even more so when dealing with external data sources like SaaS APIs. If a SaaS tool adds a new column to their data collection, your system needs to be prepared to accommodate the change, ideally without human intervention. New fields get added frequently.
Manage the ETL process
The final stage of this process is figuring out the functionality you’ll need to manage the ETL process. Data being delivered to your organization needs to be reliable; the processes you put in place here will ensure reliability, and build organizational trust around your data accuracy.
- Job Scheduling: How often will the ETL process run? Will it be triggered or run automatically? Will this vary by data source? For example, your CEO will likely want daily revenue numbers, but your customer support team probably only needs NPS data every week or two. Getting these schedules right will reduce load on your internal systems. Even in environments where aspects of your ETL pipeline are built for streaming, there will likely be other components that run in batch.
- Monitoring: If any of your systems fail, or your testing rules reveal a data inaccuracy, you’ll want to know about it. Your monitoring system escalates problems, alerting your team so they can take action immediately. You’ll likely want to integrate this with the other monitoring and alerting systems you use for other aspects of your technology infrastructure.
- Recovery: It’s not a question of if, but when, your ETL process will fail. You just need to be prepared for when it happens. Your data warehouse might go down, an extraction job might fail, a SaaS API temporarily go down or start sending you nonconforming data. Your architecture needs to plan for failure and have recovery mechanisms in place for when it happens.
- ETL Testing: In addition to recovery options, you want checks and balances in place to test data accuracy. Bilal Mahmood wrote a great piece outlining the seven most common data inaccuracies, and the rules to test for them.
- Security: Your security measures need to consider things like how to transfer data securely, but also what data gets transferred. For example, unless you’re PayPal (or a company like them) there’s likely no benefit to storing customer payments information in your analytics warehouse.
Keep learning about this topic
ETL Testing Tutorial - An incredibly in-depth resource that covers techniques, tester’s roles, backup recovery, and so much more. This is the ultimate guide to ETL testing.
ETL Architecture’s 34 Subsystems - The Kimball Group wrote the definitive text on ETL and data warehousing, The Data Warehouse Toolkit. While this isn’t all still relevant in the modern ETL environment, the 34 subsystems presents an excellent beginning point for developers looking to get the fundamentals right.
Kimball University: Six Key Decisions for ETL Architectures - Another great resource from Kimball. This article outlines six questions to ask before starting your project. Not too surprising: the first question on the list is buy vs. build.