Best Practices in Azure Data Factory

Version 2

While you can find several “how to” articles on the web about Microsoft’s Azure Data Factory (ADF), there are virtually no “why to” articles. Here we’ll define some best practices to remember while working in Azure Data Factory version 2.

Article originally published April 2020

Azure Data Factory Best Practices

Since there aren’t many guiding resources on Azure Data Factory version 2, I wanted to share some “bigger-picture” notions about how to approach orchestration and data pipelines from a more architectural perspective. Let me preface this by mentioning that ADF version 2 differs from version 1 in its features, which is why I direct the below content to heavily appeal to version 2.

Azure Date Factory Version 2

First things first – Remember that good architecture practices always call for appropriate separation of concerns/functionality between your solution layers. If you are working in ADF, it stands to reason that you are probably building a Modern Data Architecture solution in the Azure cloud. Therefore, your solution should consist of at least 3 separate components/layers:

  • The Ingestion Layer

    This layer purely focuses on the intake of raw data from source systems. Typically in Modern Data Architecture, this layer stores data in a “raw zone” in a data lake store. Perform minimal cleansing or transformation here if any exists. Further, there should be little (if any) consumption of data by non-system users from the raw zone.

  • Transformation/Experimentation Layer

    This layer is where we massage data from the raw zone into a consumable form. Typically, this process is more than just a data store. Transformations are performed on raw data, and that data is stored in the consumption layer. This is where we fulfill experimentation and data science needs, as such data sets might not be used identically for end-user consumption.

  • Consumption Layer

    This is where we store ready-to-use data for user consumption. It may consist of a formalized data warehouse or mart structure, but it also might be stored in the data lake itself in a “cleansed” or “user” zone.

Having laid out these concepts, note that Azure Data Factory version 2 doesn’t play as much of a role in the consumption end of things. It is mostly intended as a key utility in the ingestion and transformation layers. That said, the specific role of ADF and your approach to it is different between those layers.

While ingestion can be carried out solely by ADF itself (with some considerations), transformation is not as straightforward, and ADF is better relegated to the role of orchestrator.

Accelerate to your future state!

Which brings an important point into focus…

ADF is primarily an orchestration tool – not so much a data transformation tool. Yes, it has capabilities in that regard, but typical uses defer transformation logic into Databricks, Spark/Storm, or (less commonly these days) HDInsight.

What does this mean for your ELT/ETL architecture with ADF?  It means considering each layer of the solution and zone of the data architecture separately. Couple your ingestion subsystem loosely with your transformation subsystem, and consider the needs of each separately. Don’t feel compelled to force ADF into a role it’s not suited for.

Looking for more on Azure?

Explore more insights and expertise at smartbridge.com/data

There’s more to explore at Smartbridge.com!

Sign up to be notified when we publish articles, news, videos and more!