The task of managing the data stored in a data warehouse has shifted from software engineers writing ETL (extract, then transform, then load) to data analysts writing ELT (extract, then load and transform at the end).The right ELT pipelines in place will help the data and analytics team become more scalable than it was before by using software engineering best practices to write scalable and maintainable data pipelines. 

Our blog discusses some of the best practices for data modeling in this environment that we have managed to compile from our years of expertise in running the data and analytics projects with clients developing their data platforms. 

ELT pipelines

1. Get oriented

1

To get oriented, let us list down some important terms and what they mean in the sections below. 

Table: An object accessible in a database that has been written to disk. The process of querying a table involves reading data directly from the disk. 

View: An object accessible in a database that has not been written to disk. The process of querying a view involves reading data off the disk, applying the transformation logic encoded in the view and then presenting the results from the query. 

Relation: Referred to as both views and tables. It’s a table-like object in a database that is queryable regardless of how it is materialized. 

Data model: The highest level of abstraction and includes relations as well as other intermediate objects that might not be queryable at all in the database (e.g., re-usable chunks of code that are never actually created in the database). 

We opt for “transformations” when we’re modeling data to create relations in the data warehouse. 

An example of a transformation that might be modeled in the data warehouse is de-normalizing data that comes from a transactional database so that analysts do not have to rewrite the join logic every time they want to perform an analysis. 

If you’re using a SQL-based transformation tool like dbt, you might want to create a relation using a transformation definition. 

Once this relation is created in your warehouse, downstream users no longer have to remember where the account data is stored. All they have to do is query directly from the denormalized_users relation. 

Sure, this is a simple example. Most businesses require complicated transformations and queries to create objects that are core to measuring and running the business.

Importance of a transformation layer 

You may have felt the pain of a missing transformation layer at some point in your life. 

  • You have to copy and paste SQL queries repeatedly to reuse logic that someone (who knows who) originally wrote. 
  • If the production application database schema changes, all the dashboards and everyone’s reporting breaks, leaving you to fix every broken piece and notify everyone who might be using the now-defunct snippet of code. 
  • Your queries are a tangled web of SQL code with complicated CTEs and joins. 
  • Making changes to an existing report is a complicated and error-prone process. 

These problems indicate that there is a missing abstraction layer. 

We use an abstraction layer to achieve several goals. A few examples of why we need a transformation layer will help you understand better. 

  • Codify “agreed upon” logic in a version-controlled system. 
  • Expedite more complex downstream analyses by leveraging abstraction layers that facilitate higher-order reasoning with simple(r) relations and conceptual objects.  
  • Make analyses less error-prone by non-analyst business users or data scientists less familiar with the intricacies of the data. 
  • Achieving these goals will help us make the analytics team much more scalable than it was before and will free up time to invest in sophisticated and value-generating analyses rather than just maintaining a fragile data pipeline.

2. Directed Acyclic Graph (DAG) in ELT pipelines

Retail trends for 2021

 Transformation layer is a chain of transformations where raw data feeds into an initial transformation and each subsequent transformation reads from transformations at a lower level of abstraction. 

This type of chain is referred to as a Directed Acyclic Graph, or, simply, DAG. It’s “Directed” because it has a direction—from raw data to cleaned data, “Acyclic” because it cannot have transformations that are mutually dependent and a “Graph” because it can be expressed as a network of nodes and edges. 

In the real world, the data DAG is made up of many transformations touching hundreds of upstream source tables. There are lots of great tooling to help manage this complexity and develop the system efficiently. dbt has been a changer in this world as it allows for the construction and management of complex DAGs using pure SQL + Jinja templating.

Transformation principles 

To make this system deliver value, there are a number of principles we recommend following when developing this system. 

  • Use schemas to group transformations by type 

Leverage schemas to namespace your transformations and distinguish them by types. Namespacing can directto end-users where they can find a relation they are looking for.  

Namespacing by business units is common or it could also be by abstraction level or data source.  

For example, you may want to separate your transformations that do LTV-projection and marketing attribution from those that are doing cube-like pre-aggregations from those that are doing more straightforward data cleaning and pre-processing. 

Namespacing help developers organize the codebase to know where to look to find the logic for a given topic or abstraction. 

  • Follow the grain 

By “grain” of a relation, we mean that relation’s fundamental unit-of-analysis. A user’s relation has the grain of a user if each row represents a single user. 

In fully denormalized schemas (like in an application database) it is easy to understand what the grain of every table is. However, once we move into the world of more complex relations in a data warehouse, it becomes particularly important to keep track of the grain of each relation. 

  • Schemas to control access to different relations 

You can use different schemas to provide more fine-grained access control to the relations in the warehouse. A common use-case is removing sensitive data from non-sensitive data (e.g., separating PII) and using schemas to control access to the sensitive data. 

  • Schemas to set your warehouse public API 

Only provide query access for BI tools and analysts to the most downstream schemas. 

This way, any changes made only to upstream “intermediate” relations will not impact downstream users. This is great for handling column-name changes. 

This is also a way of defining the difference between the “public API” of the data warehouse that is exposed to end users and the intermediate or purely internal relations that are subject to change without notice. 

  • Know what belongs in the presentation layer 

It can be tricky to decide what belongs to a BI tool as opposed to what belongs in the database. I do not have any fixed rules around this, but there are some general guidelines to follow. 

  • Presentation layer (the BI tool): 

– Simple joins are used to combine relations with different grains (e.g., join users to orders) 

– Simple translations purely for presentation 

(a) Timezone shifting 

(b) Mapping geographic states onto geographic regions 

(c) Title-casing strings of product names

  • Data warehouse: 

(a) Any logic that needs to be re-used across multiple different reports or relations. 

(b) Anything that needs to be used by multiple different downstream consumers (e.g., the BI tool, data scientists and a ML pipeline all need access to a consistent field)

Developing ELT Pipelines

3. Code Smells

MicrosoftTeams image 80

It is useful to cover some of the code smells when reviewing ELT pipelines. 

These are not fixed rules, rather things to keep in mind. It is okay to break these rules, only in times when you have a very good reason to. 

Wet codes 

Wet code could improve maintainability. A good example is the date logic. Instead of copying and pasting the incantation for finding the first day-of-the-week, factor out into one canonical calendar relation that you can join to anywhere you need access to the first-day-of-week variable. 

Layer violations 

If you’ve organized your data warehouse as it is recommended here, you’ll begin to note that your warehouse and your DAG are forming “layers”. The latest and most “raw” relations will be at the top and the most “clean” relations will be at the bottom. 

Check if you see a relation at the downstream end of your DAG reaching deep into the upstream layers (e.g., via a join). Ideally, your BI tool should only be touching the bottom-most layer of the DAG.  

If you find your BI tool reaching deep into the raw and untransformed layers of the data warehouse, it means that your transformations are not modeling all the concepts that are important to your business and you need to do a better job of passing those through to the downstream layers. 

Multiple CTEs 

An easy smell to identify is when there is a transformation that has more than 3 CTEs in it. These long, complicated queries are difficult to maintain and debug. 

Factoring out some of these transformations into their own relations will make your code easier to read and, more importantly, easier to test. 

Each relation should behave like a function would in an imperative programming paradigm. You should compose those different abstractions rather than trying to jam all the logic into one big function. You should treat your warehouse relations the same way. 

GROUP by all 

Always look for is SQL that is frequently grouping by lots of different columns. This often indicates that the code is not obeying a consistent grain and is instead “dragging around” a lot of columns simply because they might be useful in the future. 

Ensure that a code consistently groups by the grain of the relation. If the grain is a user, you should be grouping by user_id only. If you are grouping by user_id, user_email, user_dob, etc. it means that your code could be refactored into separate relations or CTEs that operate at the correct grain and then join everything together at the end.

Final thoughts on ELT pipelines

These best practices are presented here as suggestions and not as hard and fast rules. If you understand the conceptual ideas behind these best practices, you will be able to write readable, maintainable and extensible ELT pipelines code that will improve your team’s velocity now as well as in the future.

transformation layer

[mailerlite_form form_id=1]