Streamlining Data Transformation with Metadata-driven ETL in Azure

September 20, 2023

Blogger Image

Poreddy Ashok

Technical Architect, Analytics

Modern data platforms are built to process data dynamically by creating metadata-driven data pipelines. In addition to providing more scalability, a metadata-driven framework speeds up development and offers better maintainability, visibility, and reusability. They enable users to process thousands of tables and apply various processing steps without designing all data flows manually.

Metadata-driven ETL revolves around using metadata, which is data about data, to drive the ETL process. Metadata provides information about the structure, format, and semantics of the data, enabling automation and standardization of the ETL process. Instead of hard-coding every aspect of the data integration and transformation process, metadata-driven ETL allows developers and data engineers to define the ETL logic, rules, and transformations in a metadata repository. This repository is a centralized control system that drives the ETL process based on the metadata definitions.

When the data processing pipeline is designed by a metadata-driven approach, data lineage won't appear out of the box since all processing logic sits in a metadata repository. This problem can be addressed by hooking up a data lineage registration component like Microsoft Azure Purview.

In this blog, we will learn more about metadata-driven ETL in Microsoft Azure cloud computing platform, and explore a metadata-driven approach to develop a data platform for SQL Server-based data stores.

Setting Up a Metadata Repository

While designing a metadata-driven processing framework, users can set up a database for metadata configuration. A metadata-driven process differs from building transformations in an ETL tool. Instead, users can store specifications in an external database, which feeds that information to an engine, such as Azure Data Factory (ADF) or Spark.

We will use an SQL Server database as a metadata repository for this blog post. This initial framework consists few tables, views, and stored procedures for defining the components used in the data platform. The source and destinations are SQL Server databases – AdventureWorksDW2019 is used as the source, and AdventureWorksUAT as the destination database. The metadata repository also exists in the destination database. Users can find the code for tables, views, and stored procedures in this Git repository.

Tables

  • ETL Systems: All data sources, destinations.
  • ETL Data Period: Volume of live data and historical data based on a date.
  • ETL Tables: Tables to be loaded from a source.
  • ETL Columns: Columns of each table to be loaded.
  • ETL Pipelines: Placeholder for the data pipelines, including name and description.
  • ETL Log Tables: Used for storing logging results for the tables' load status.
  • ETL Transformations: References to the source, destination, columns, etc.
  • ETL Aggregations: Aggregations on numeric columns grouped by subject-oriented dimension keys.
  • ETL Log Pipelines: Used for storing logging results for pipelines execution status.

In a real-time, comprehensive framework, there are additional metadata tables for validations, transformations, scheduling, triggering pipelines, security on data usage, etc. These are not in the scope of this blog. We will cover this in the next blog post.

Views

  • vLogSuccessTables: Display table load success details from the [dbo].[ETL Log Tables] table.
  • vLogFailureTables: Display table load failure details from the [dbo].[ETL Log Tables] table.

Stored Procedures

  • spGetSourceConnection: Get the source connection details like server, database, username, and password.
  • spGetDestinationConnection: Get the destination connection details like server, database, username, and password.
  • spGetTableList: Get the list of tables to be loaded to the destination database. Table names are configured in [dbo].[ETL Tables] table whether to load or not.
  • spLoadTable: Generate the load script (SELECT query) for a given source table based on the configurations specified in [dbo].[ETL Columns] table. This script is used in the ADF copy activity.
  • spTruncateTable: Truncate the given table before loading data. This is used in the ADF copy activity's pre-copy script property of the sink section. Truncating a table before loading data is varied based on business requirements. In this blog, we are trying to empty the table and load data.
  • spLogTableLoad: Log the data loading status for each table in [dbo].[ETL Log Tables] table. Logging details including Status (Success/Failure), TableName, StartTime, EndTime, Duration, RecordCount, etc.

Data Landing Zones

While loading data from source to destination, it is necessary to clean, validate, and transform the data according to business rules. Users must have data landing zones like RAW, STAGE, CURATED, etc. These landing zones and transformations are part of the data life cycle, i.e., from data capturing to data consuming.

  • RAW: Data is in the original format without any validations. It is read-only and accessible to service accounts only.
  • STAGE: This is the general staging after cleaning, validation, etc. It is accessible to data engineers, data analysts, and data scientists.
  • EDW/CURATED: Data is processed after applying different transformations and is ready to consume for dashboards. It is accessible to data engineers, data analysts, and data scientists.
  • SANDBOX: This is for laboratory purposes to perform EDA (Explanatory Data Analysis) for AI-ML applications. It is accessible to data engineers and data scientists.
  • TEMP: Supports ingestion framework by creating and cleaning temporary data sets, etc. Data is short-lived here. It is accessible to service accounts and data engineers.
  • MASTER: It is reference data and archive data. It is accessible to data engineers, data analysts, and data scientists.
  • We can configure these landing zones in the metadata repository. Still, it is not in the scope of this blog due to its complexity. We will cover this in the next blog post.

    Azure Data Factory (ADF)

    Once the metadata repository is set up, users can configure the pipelines. ADF provides many features to develop metadata-driven pipelines. Different activities are used to get the metadata configurations and make the pipelines as engines in the data platform.

    For example, the Lookup activity fetches the details of components to be loaded and processed. Components like the list of tables and columns from each table are already configured in the metadata repository tables. Then activities like For-each, Copy, and Data Flow Transformations in Azure Data Factory are used to further load and perform data processing. To access the complete step-wise illustration document, use this link Git repository.

    SHIR – Selfhosted Integration Runtime

    In this blog post, we are using SQL Server on the local machine. SHIR needs to be installed and configured on the local machine. Please follow the below link to learn how to install and configure SHIR.

    Linked Services

    The below linked services and data sets are configured. They are utilized in various activities of the ADF pipelines.

    1. DPMD_LS_Local_SQLServer: It connects to the on-prem SQL Server metadata repository and gets connection details for source and destination. It will give us the connection details of the source, destination, and staging areas. It will be used by the parent pipeline and pass the connection details through the parameters.
    2. DPMD_LS_SRC_Local_SQLServer: It connects to the on-prem source SQL Server and is populated through the dataset parameters.
    3. DPMD_LS_DEST_Local_SQLServer: It connects to the on-prem destination SQL Server. It is also populated through the dataset parameters.

    Data sets

    1. DPMD_DS_Local_SQL: Connect to the on-prem SQL Server metadata repository. It passes the connection parameters to DPMD_LS_Local_SQLServer linked service.
    2. DPMD_DS_SRC_Local_SQL: Connect to the on-prem source SQL Server. It passes the connection parameters to DPMD_LS_SRC_Local_SQLServer linked service.
    3. DPMD_DS_DEST_Local_SQL: Connect to the on-prem destination SQL Server. It passes the connection parameters to DPMD_LS_DEST_Local_SQLServer linked service.
    4. DPMD_DS_DEST_LocalSQL_Table: Connect to the on-prem destination SQL Server. It dynamically maps the source table to a destination table in the copy activity.

    00-Get-Connections

    This pipeline consists of parameters for connecting to the metadata repository and captures the connection details for the source and destination. Parameters can be passed from a scheduling trigger or configured manually during deployment.

    Get-Source-Conn

    This lookup activity calls the stored procedure [dbo].[spGetSourceConnection] and gets the source connection details like server, database, username, and password.

    Get-Dest-Conn

    This lookup activity calls the stored procedure [dbo].[spGetDestinationConnection] and gets the destination connection details like server, database, username, and password.

    Load-Data

    This execute pipeline activity executes other pipelines to load the data. It passes the parameters to the pipeline for source and destination connections, which are captured in the lookup activities.

    01-Load-Data

    This pipeline takes the parameters from the parent pipeline for source and destination connections and uses them in various activities. It also consists of variables that are assigned values and used in the activities.

    Get-Table-List

    This lookup activity calls the stored procedure [dbo].[spGetTableList] and get the list of tables to load based on the configurations in the metadata repository table [dbo].[ETL Tables]. Tables tagged with value 1 for column [Enable] are considered for loading.

    Load-Each-Table

    This for-each activity load each table captured in the above lookup activity. It consists below activities to accomplish the task.

    Get-Select-Query

    This lookup activity calls the stored procedure [dbo].[spLoadTable] and get the generated select query based on the configurations in the metadata repository table [dbo].[ETL Columns].

    SelectQuery

    This set variable activity assigns the generated select query to a variable. This query is then used in the copy activity source section.

    Get-Start-time

    This set variable activity assigns the start time of a table load to a variable. This value is then used in the logging part.

    CopyTable

    This copy activity executes the select query at the source database and loads the data to a destination database. We have used the same table name at the destination database to simplify the task.

    Get-End-time

    This set variable activity assigns the end time of a table load to a variable. This value is then used in the logging part in case of success.

    Get-End-time_2

    This set variable activity assigns the end time of a table load to a variable. This value is then used in the logging part in case of failure.

    Log-Success

    This execute pipeline activity run the pipeline 11-Log-Table-Load and logs the success details to [dbo].[ ETL Log Tables] table. It passes the below parameters to the pipeline.

    Log-Failure

    This execute pipeline activity run the pipeline 11-Log-Table-Load and logs the success details to [dbo].[ ETL Log Tables] table. It passes the below parameters to the pipeline.

    11-Log-Table-Load

    This pipeline consists of parameters for logging details. It executes a stored procedure to log the details in case of success or failure.

    spLogTableLoad

    This activity calls the stored procedure [dbo].[spLogTableLoad] and log the details based on the pass parameter values.

    Executing the pipeline

    Users can run the parent pipeline 00-Get-Connections by providing metadata connection details like server, database, username, and password through pipeline parameters. Then it will capture source and destination connection details, table list, and load the data.

    Conclusion

    In this blog, we have elaborated on a metadata-driven approach to developing a data platform for SQL Server based data stores. Extending this to include tools like Data Lake for staging and Databricks notebooks for data processing is possible.

    We can build data pipelines by replacing ADF or combining ADF and notebooks. The advantage is we can configure trained ML models and map the data set to those models for next-level advanced analytics utilizing Azure AI and ML services and libraries with a programming language like Python. This way, we can build a cloud-based Azure Modern Data Platform with descriptive predictive prescriptive and diagnostic analytics .

Join the conversation

What are your thoughts on this blog? Drop us a line below. We’d love to hear from you.

© 2024 Nous Infosystems. All rights reserved.