Modern data platforms are built to process data dynamically by creating metadata-driven data pipelines. In
addition to providing more scalability, a metadata-driven framework speeds up development and offers better
maintainability, visibility, and reusability. They enable users to process thousands of tables and apply
various processing steps without designing all data flows manually.
Metadata-driven ETL revolves around using metadata, which is data about data, to drive the ETL process.
Metadata provides information about the structure, format, and semantics of the data, enabling automation
and standardization of the ETL process. Instead of hard-coding every aspect of the data integration and
transformation process, metadata-driven ETL allows developers and data engineers to define the ETL logic,
rules, and transformations in a metadata repository. This repository is a centralized control system that
drives the ETL process based on the metadata definitions.
When the data processing pipeline is designed by a metadata-driven approach, data lineage won't appear out
of the box since all processing logic sits in a metadata repository. This problem can be addressed by
hooking up a data lineage registration component like Microsoft Azure Purview.
In this blog, we will learn more about metadata-driven ETL in Microsoft Azure cloud computing platform, and
explore a metadata-driven approach to develop a data platform for SQL Server-based data stores.
Setting Up a Metadata Repository
While designing a metadata-driven processing framework, users can set up a database for metadata
configuration. A metadata-driven process differs from building transformations in an ETL tool. Instead,
users can store specifications in an external database, which feeds that information to an engine, such as
Azure Data Factory (ADF) or Spark.
We will use an SQL Server database as a metadata repository for this blog post. This initial framework
consists few tables, views, and stored procedures for defining the components used in the data platform. The
source and destinations are SQL Server databases – AdventureWorksDW2019 is used as the source, and
AdventureWorksUAT as the destination database. The metadata repository also exists in the destination
database. Users can find the code for tables, views, and stored procedures in this
Git
repository.
Tables
- ETL Systems: All data sources, destinations.
- ETL Data Period: Volume of live data and historical data based on a date.
- ETL Tables: Tables to be loaded from a source.
- ETL Columns: Columns of each table to be loaded.
- ETL Pipelines: Placeholder for the data pipelines, including name and description.
- ETL Log Tables: Used for storing logging results for the tables' load status.
- ETL Transformations: References to the source, destination, columns, etc.
- ETL Aggregations: Aggregations on numeric columns grouped by subject-oriented dimension
keys.
- ETL Log Pipelines: Used for storing logging results for pipelines execution status.
In a real-time, comprehensive framework, there are additional metadata tables for validations,
transformations, scheduling, triggering pipelines, security on data usage, etc. These are not in the scope
of this blog. We will cover this in the next blog post.
Views
- vLogSuccessTables: Display table load success details from the [dbo].[ETL Log Tables]
table.
- vLogFailureTables: Display table load failure details from the [dbo].[ETL Log Tables]
table.
Stored Procedures
- spGetSourceConnection: Get the source connection details like server, database,
username, and password.
- spGetDestinationConnection: Get the destination connection details like server,
database, username, and password.
- spGetTableList: Get the list of tables to be loaded to the destination database. Table
names are configured in [dbo].[ETL Tables] table whether to load or not.
- spLoadTable: Generate the load script (SELECT query) for a given source table based on
the configurations specified in [dbo].[ETL Columns] table. This script is used in the ADF copy activity.
- spTruncateTable: Truncate the given table before loading data. This is used in the ADF
copy activity's pre-copy script property of the sink section. Truncating a table before loading data is
varied based on business requirements. In this blog, we are trying to empty the table and load data.
- spLogTableLoad: Log the data loading status for each table in [dbo].[ETL Log Tables]
table. Logging details including Status (Success/Failure), TableName, StartTime, EndTime, Duration,
RecordCount, etc.
Data Landing Zones
While loading data from source to destination, it is necessary to clean, validate, and transform the data
according to business rules. Users must have data landing zones like RAW, STAGE, CURATED, etc. These landing
zones and transformations are part of the data life cycle, i.e., from data capturing to data consuming.
- RAW: Data is in the original format without any validations. It is read-only and
accessible to service accounts only.
- STAGE: This is the general staging after cleaning, validation, etc. It is accessible to
data engineers, data analysts, and data scientists.
- EDW/CURATED: Data is processed after applying different transformations and is ready to
consume for dashboards. It is accessible to data engineers, data analysts, and data scientists.
- SANDBOX: This is for laboratory purposes to perform EDA (Explanatory Data Analysis) for
AI-ML applications. It is accessible to data engineers and data scientists.
- TEMP: Supports ingestion framework by creating and cleaning temporary data sets, etc.
Data is short-lived here. It is accessible to service accounts and data engineers.
- MASTER: It is reference data and archive data. It is accessible to data engineers, data
analysts, and data scientists.
We can configure these landing zones in the metadata repository. Still, it is not in the scope of this
blog due to its complexity. We will cover this in the next blog post.
Azure Data Factory (ADF)
Once the metadata repository is set up, users can configure the pipelines. ADF provides many features to
develop metadata-driven pipelines. Different activities are used to get the metadata configurations and
make the pipelines as engines in the data platform.
For example, the Lookup activity fetches the details of components to be loaded and processed.
Components like the list of tables and columns from each table are already configured in the metadata
repository tables. Then activities like For-each, Copy, and Data Flow Transformations in Azure Data
Factory are used to further load and perform data processing. To access the complete step-wise
illustration document, use this link Git
repository.
SHIR – Selfhosted Integration Runtime
In this blog post, we are using SQL Server on the local machine. SHIR needs to be installed and
configured on the local machine. Please follow the below link to learn how to install and configure
SHIR.
Linked Services
The below linked services and data sets are configured. They are utilized in various activities of the
ADF pipelines.
- DPMD_LS_Local_SQLServer: It connects to the on-prem SQL Server metadata repository
and gets connection details for source and destination. It will give us the connection details of
the source, destination, and staging areas. It will be used by the parent pipeline and pass the
connection details through the parameters.
- DPMD_LS_SRC_Local_SQLServer: It connects to the on-prem source SQL Server and is
populated through the dataset parameters.
- DPMD_LS_DEST_Local_SQLServer: It connects to the on-prem destination SQL Server. It
is also populated through the dataset parameters.
Data sets
- DPMD_DS_Local_SQL: Connect to the on-prem SQL Server metadata repository. It passes
the connection parameters to DPMD_LS_Local_SQLServer linked service.
- DPMD_DS_SRC_Local_SQL: Connect to the on-prem source SQL Server. It passes the
connection parameters to DPMD_LS_SRC_Local_SQLServer linked service.
- DPMD_DS_DEST_Local_SQL: Connect to the on-prem destination SQL Server. It passes
the connection parameters to DPMD_LS_DEST_Local_SQLServer linked service.
- DPMD_DS_DEST_LocalSQL_Table: Connect to the on-prem destination SQL Server. It
dynamically maps the source table to a destination table in the copy activity.
00-Get-Connections
This pipeline consists of parameters for connecting to the metadata repository and captures the
connection details for the source and destination. Parameters can be passed from a scheduling trigger or
configured manually during deployment.
Get-Source-Conn
This lookup activity calls the stored procedure [dbo].[spGetSourceConnection] and gets the source
connection details like server, database, username, and password.
Get-Dest-Conn
This lookup activity calls the stored procedure [dbo].[spGetDestinationConnection] and gets the
destination connection details like server, database, username, and password.
Load-Data
This execute pipeline activity executes other pipelines to load the data. It passes the parameters to
the pipeline for source and destination connections, which are captured in the lookup activities.
01-Load-Data
This pipeline takes the parameters from the parent pipeline for source and destination connections and
uses them in various activities. It also consists of variables that are assigned values and used in the
activities.
Get-Table-List
This lookup activity calls the stored procedure [dbo].[spGetTableList] and get the list of tables to
load based on the configurations in the metadata repository table [dbo].[ETL Tables]. Tables tagged with
value 1 for column [Enable] are considered for loading.
Load-Each-Table
This for-each activity load each table captured in the above lookup activity. It consists below
activities to accomplish the task.
Get-Select-Query
This lookup activity calls the stored procedure [dbo].[spLoadTable] and get the generated select query
based on the configurations in the metadata repository table [dbo].[ETL Columns].
SelectQuery
This set variable activity assigns the generated select query to a variable. This query is then used in
the copy activity source section.
Get-Start-time
This set variable activity assigns the start time of a table load to a variable. This value is then used
in the logging part.
CopyTable
This copy activity executes the select query at the source database and loads the data to a destination
database. We have used the same table name at the destination database to simplify the task.
Get-End-time
This set variable activity assigns the end time of a table load to a variable. This value is then used
in the logging part in case of success.
Get-End-time_2
This set variable activity assigns the end time of a table load to a variable. This value is then used
in the logging part in case of failure.
Log-Success
This execute pipeline activity run the pipeline 11-Log-Table-Load and logs the success details to
[dbo].[ ETL Log Tables] table. It passes the below parameters to the pipeline.
Log-Failure
This execute pipeline activity run the pipeline 11-Log-Table-Load and logs the success details to
[dbo].[ ETL Log Tables] table. It passes the below parameters to the pipeline.
11-Log-Table-Load
This pipeline consists of parameters for logging details. It executes a stored procedure to log the
details in case of success or failure.
spLogTableLoad
This activity calls the stored procedure [dbo].[spLogTableLoad] and log the details based on the pass
parameter values.
Executing the pipeline
Users can run the parent pipeline 00-Get-Connections by providing metadata connection details like
server, database, username, and password through pipeline parameters. Then it will capture source and
destination connection details, table list, and load the data.
Conclusion
In this blog, we have elaborated on a metadata-driven approach to developing a data platform for SQL
Server based data stores. Extending this to include tools like Data Lake for staging and Databricks
notebooks for data processing is possible.
We can build data pipelines by replacing ADF or combining ADF and notebooks. The advantage is we can
configure trained ML models and map the data set to those models for next-level advanced analytics
utilizing Azure AI and ML services and libraries with a programming language like Python. This way, we
can build a cloud-based Azure
Modern Data Platform with descriptive predictive prescriptive and
diagnostic analytics .