ETL Architecture

The most common mistake people make when building an ETL system – or arguably any technology project – is that they jump into buying technology and writing code before thinking through the needs of their organization. 

Before you start building your ETL architecture, consider some fundamental questions. What sources do you need to extract data from? What warehousing technology are you using? What’s the analytical skill level within your organization? Why is this project important to your overarching company goals? The answers to these questions should help you determine your overall data architecture, a necessary outline to establish prior to starting your project.

Enterprise architect and consultant Chris Lockhart describes the importance of architecture-first thinking like this: 

 
A comprehensive architecture that is made up of things like mission, vision, goals, objectives, requirements, roles, activities, capabilities, services, technical components and yes, physical components is a holistic thing. It is something we bake, not something served up in piecemeal ingredients. We don’t eat flour and call it bread just because everyone around us happens to be an expert in flour. An architecture is not a single one of those ingredients. It is the result of an intentional effort to bring those parts together so that they positively impact the matter at hand. Talking about servers and middleware and network all day, no matter how appealing to some (including myself) will never go far in solving business problems. Frankly it will just make you look like the stereotype Dilbert character. Your organization will lose all credibility with the folks who pay the bills.
 

To Lockhart, architecture is the vision and the planning toward that vision. 

Here are five things you should do when designing your ETL architecture:

  • Understand your organizational requirements
  • Audit your data sources
  • Determine your approach to data extraction
  • Build your cleansing machinery
  • Manage the ETL process

Understand your organizational requirements

There comes a time in every business's life when joining data across disparate data sets on a one-off basis is no longer tenable: You’re suddenly overloaded with spreadsheets, or you start to calculate how much time you’re spending on perpetually-breaking ETL scripts. Before you jump into a solution, make sure you understand why ETL is a priority right now for your organization.

  1. Data sources: Which data sources does your team most want to work with? How much time are you willing to dedicate to building a new integration when the situation arises? For many startups and growing businesses, data sources are fairly standardized. You’ll likely be extracting data from transactional databases such as MongoDB and MySQL and SaaS tools like Salesforce and Google Analytics.
  2. Usage: Where will the data be loaded, who will be using it, and how will they be using it? If you’re provisioning data for analytics, that’s going to look different than if you’re provisioning data to power your product, as businesses like Uber or Airbnb do. 
  3. Latency: How often do your end users need fresh data? For systems alerting, you need your data to be updated every second. If you’re delivering data for financial reporting, once a week might be enough. If your data scientists want to work on a data model, you may only need to deliver data once.

Answering these foundational questions will help you make the first critical decision about your ETL architecture: Should you build it or buy it? If you’re extracting data from standard sources and using it primarily for business analytics, buying it is often the right choice. There's no need to reinvent the wheel. On the other hand, if you're Uber and you need to make real-time decisions based on location data from hundreds of thousands of cars, you may need a custom solution.

Audit your data sources

The next stage of the architecture process requires that you build a profile of the data sources your business uses. You’ll want to consider the production databases that power your business, as well as the disparate data sets that live in your SaaS tools. Here are some common areas:

Production databases Sales and marketing sources Customer support sources Operational sources
MongoDB
MySQL
PostgreSQL
MariaDB
Microsoft SQL Server
Oracle
Web Analytics
- Google Analytics
- Mixpanel
Marketing Automation
- HubSpot
- Marketo
Email Service Providers
- MailChimp
- Constant Contact
Ad data
- Google Adwords
- Twitter Ads
- Facebook Ads
CRM
- Salesforce
- SugarCRM
- Insightly
Help desks
- Zendesk
- Desk.com
Customer chat
- Intercom
- Zendesk Chat
Payments
- Zuora
- Recurly
- Stripe
Accounting
- Quickbooks
- Xero
Shopping cart platforms
- Shopify
- Magento

It’s important during this process to understand what data sources are a top priority. For most companies, consistent financial data around things like revenue, MRR, and unit economics will be the top priority. Identifying these priorities provides you with information you need to make tradeoffs around data completeness vs. speed of project completion.

Determine your approach to data extraction

Two questions should guide your approach to data extraction:

  1. What sources are you extracting data from?
  2. What methods of data extraction will you use?

The Extraction section of this website covers these points in detail.

Build your cleansing machinery

In the Transform section of this site, we explore how and why improved data warehouse processing has impacted the transformation stage of ETL. Modern ETL systems leave the bulk of transformations for the analytics stage, and focus instead on loading minimally processed data into the warehouse. This increases the flexibility of your ETL process and gives your analysts greater control over data modeling.

That said, you may need to do some basic data cleansing during the ETL process. Certain commonly occurring scenarios should be handled prior to loading data into your warehouse:

  1. Restructuring data: Often, the way data is structured in a source data store makes it difficult or non-optimal to load it into another data store. This is the case, for example, when translating nested structures from a JSON API or MongoDB into a relational structure. 
  2. Data typing: You need to type the data entering your system and maintain that type (currency, date, etc.) as it travels through your ETL process.  
  3. Data schema evolution: This is a consideration for internal systems, but it applies even more when you're dealing with external data sources like SaaS APIs. If a SaaS tool adds a new column to its data collection, your system needs to be prepared to accommodate the change, ideally without human intervention. As products and APIs evolve, new fields get added frequently.

Manage the ETL process

The final stage of designing an ETL architecture is figuring out the functionality you’ll need to manage the ETL process. Data being delivered to your organization needs to be reliably available and accurate; the processes you put in place here will ensure reliability and build organizational trust around your data accuracy.

  1. Job scheduling: How often should the ETL process run? Should it be triggered or run automatically? Should this vary by data source? For example, your CEO will likely want daily revenue numbers, but your customer support team probably needs NPS data only every week or two. Getting these schedules right will reduce load on your internal systems. Even in environments where aspects of your ETL pipeline are built for streaming, there will likely be components that run in batch.
  2. Monitoring: If any of your systems fail, or your testing rules reveal a data inaccuracy, you’ll want to know about it. Your monitoring system should escalate problems, alerting your team so they can take action immediately. You’ll likely want to integrate this with the other monitoring and alerting systems you use for other aspects of your technology infrastructure. 
  3. Recovery: It’s not a question of if, but when, your ETL process will fail. You need to be prepared for when it happens. Your data warehouse might go down, an extraction job might fail, a SaaS API might temporarily go down or start sending you nonconforming data. Your architecture needs to plan for failure and have recovery mechanisms in place for when it happens. 
  4. ETL testing: In addition to recovery options, you want checks and balances in place to test data accuracy. Bilal Mahmood wrote a great piece outlining the seven most common data inaccuracies and the rules to test for them. 
  5. Security: Your security measures need to consider things like how to transfer data securely, as well as what data gets transferred. For example, unless you’re PayPal (or a company like it) there’s likely no benefit to storing customer payments information in your analytics warehouse. 

Keep Learning about ETL Architecture

ETL Testing Tutorial – An in-depth resource that covers techniques, testers’ roles, backup recovery, and much more. This is the ultimate guide to ETL testing.

ETL Architecture’s 34 Subsystems – The Kimball Group wrote the first definitive text on ETL and data warehousing, The Data Warehouse Toolkit. While the book, now in its third edition, isn’t all still relevant in the modern ETL environment, the 34 subsystems it presents make an excellent beginning point for developers looking to get the fundamentals right.

Kimball University: Six Key Decisions for ETL Architectures – Another great resource from Kimball. This article outlines six questions to ask before starting your project. Not too surprising: the first question on the list is buy vs. build.