Engineering

Data Vault - The Complete Guide (2022 Edition)

Data Warehouse Basics

A Data Warehouse is a centralized storage system for essential business information. It serves as a data management system designed to enhance business intelligence (BI) by enabling the querying and analysis of extensive data sets, including application logs, transactional data, and user data. Consequently, the design of data warehouses must prioritize performance based on business needs. For example, a business with high data influx requires efficient write capabilities, while a business focused on querying may prefer a model optimized for faster reads.

‍

Characteristics of a Data Warehouse

Data warehouses are defined by five key characteristics: subject-oriented, integrated, time-variant, and non-volatile. These traits are essential for creating an efficient, business intelligence-ready data warehouse.

First, being subject-oriented means that a data warehouse is designed to store data relevant to specific themes, such as sales, distribution, and marketing.

Second, integration signifies that data from multiple sources can be combined and consolidated within the warehouse.

Third, the time-variant aspect refers to the organization of data according to different time periods, enabling historical analysis.

Finally, non-volatile means that once data is entered into the warehouse, it remains unchanged—data can be added but not altered or deleted.

‍

What Are The Three Data Warehouse Models?

There are three main data warehouse models namely: Kimball, Inmon, and Data Vault. Each model has its own strengths and weaknesses. Hence, their usage highly depends on business requirements.

‍

Inmon

Source: Geeksforgeeks.com

The Inmon approach involves an enterprise data warehouse that logically separates (normalizes) data into data marts. Data marts represent one particular area of focus and usually correspond to its own reporting tool. Examples of particular areas of focus include product details, business drivers, aspects, relationships, dependencies, and affiliations.

The enterprise data warehouse acts as the single source of truth to provide data consistency. The advantage of this model is that writing data is easy due to the normalized nature of the model. However, querying data is more difficult for the very same reason.

‍

Kimball

Source: Geeksforgeeks.com

The Kimball approach is also often referred to as the star schema. A dimensional data warehouse logically segregates multiple interconnected data marts. A data mart in the Kimball model represents one subject area with several attributes. Common attributes connect data marts to each other.

Compared to the Inmon approach, the denormalized nature of the Kimball model makes it easy to query data because of its ground-up model. However, the tradeoff lies in the fact that there is no single source of truth--meaning there may be data irregularities.

‍

Data Vault

Source: Researchgate

The Data Vault Model (DVM) was created by Dan Linstedt with the tagline “Turning Big Data into Sensible Business Information”. It is more than a model. It is also an architectural approach designed for agile warehousing. It is meant to deliver an enterprise data warehouse (similar to the Inmon approach) whilst removing the drawbacks of the 3rd Normal Form (3NF) and dimensional modeling techniques.

‍

Features

Dan Linstedt describes the Data Vault as a collection of normalized tables that meticulously track historical data and maintain unique links, supporting various business functions. Its primary characteristics include:

Separation of structural information from descriptive information, enhancing flexibility and minimizing the need for re-engineering.
Support for parallel data loading processes.
Preservation of raw, unfiltered data.
Maintenance of data integrity, ensuring data remains unchanged.
A structure that does not permit direct data exploitation

‍

Tiers

1. Preparation Area

This layer collates data from multiple sources.

‍

2. “Data Vault” and Business-Oriented Data Warehouse

This is considered the “core” layer. It contains untransformed data and entities such as hubs, links, and satellites. The Business Data Warehouse contains the resulting data from the application of business logic.

‍

3. Data Marts

This layer is also known as the “presentation” layer. Data from the business data warehouse is exploited for reporting and mining by the OLAP cubes. Additionally, this layer is deployed via dimensional modeling.

Entities

Source: DWGeek.com

‍

As mentioned earlier, there are 3 entities in a data vault: The hub, link, and satellite.

1. Hub

Contains a list of unique business keys having its own surrogate key. These business keys represent core business concepts such as customer, vendor, sale, or product. They are used to track and locate information within the business data warehouse.

2. Link

Establishes connections between hubs (business keys) and other links. Note that links do not hold data--they simply represent the relationship between data contained in the hub.

3. Satellite

Holds temporal attributes that can change over time and simultaneously acts as a historical record. It is a time-dimensional table that contains descriptive information which provides context to the hub’s/link’s business keys.

‍

Why Data Vault?

Though all data models deliver relative to what is needed, today’s needs have evolved over the past decade. Some of today’s common business requirements include:

agile incremental delivery
scalability, real-time processing
one time writing
DevOps
data variety
speed
data science integration
auditing system
security
embedded analytics

‍

Data Vault is designed to meet modern business requirements by addressing key needs. Here are some of the main benefits of using Data Vault:

Comprehensive Business System: Data Vault is more than just a data warehouse model. It addresses concerns related to people, processes, technology, architectural structure, and continuous improvement, making it a complete methodology.
Continuous Evolution: The Data Vault community is dedicated to ongoing enhancement. Each year, thought leaders gather at the Data Vault 2.0 conference to discuss innovative ways to adapt to new trends.
Agility: Data Vault allows for the quick setup of a small production system, delivering visible value rapidly. Its agile nature supports both small and multi-petabyte datasets.
Integrated Analytics: Data Vault facilitates meaningful insights by enabling the analysis of multiple data types from different sources in conjunction with each other.
Auditability: Data Vault keeps historical records, allowing for data tracing over time, ensuring audibility and accountability. This feature makes it a reliable source of truth while accommodating constant data changes.
Support for Various Data Types: Data Vault can handle structured, semi-structured, and unstructured data, making it versatile in terms of data support.
Automation: By reducing ETL code to multiple SQL queries, Data Vault minimizes development expenses and operational costs through automation.
Real-Time Data Feeds: Stakeholders can view data in real-time, as Data Vault supports real-time data feeds.
Experimentation and Flexibility: Data Vault allows for experimentation, enabling developers and designers to easily roll back to the last working version if needed, without worrying about making permanent changes.

‍

Data Vault 1.0 Versus Data Vault 2.0

Data Vault 2.0 represents the latest advancements in Data Vault Modeling (DVM), incorporating contemporary best practices. Here are the key enhancements made in DVM 2.0 compared to DVM 1.0:

Hashkeys:
- Primary keys can now be sequence numbers, hash keys, or business keys.
- Sequence numbers have limitations such as an upper limit, lack of business relevance, and performance issues due to load constraints.
- Business keys, in contrast, are tied to business concepts, serving as consistent identifiers throughout the dataset's lifespan, and they support parallel loading.
- Surrogate keys are used as object identifiers instead of direct business keys. Data Vault 2.0 introduces hash keys as surrogate keys, enabling parallel data loading by removing dependencies between entities. Hash keys support lazy joins, are deterministic, and can handle unstructured and multi-structured datasets with a single field primary key attribute. This facilitates integration between heterogeneous data environments like Hadoop and Oracle and allows scalable big data system loads.

‍

Links:
- Previously, links represented relations, events, or transactions, leading to link-to-link constructs.
- In DV 2.0, the best practice is to use links solely for business relations, aligning with the core pattern and avoiding the previous deviations where links were used for events or transactions.

‍

Reference Tables:
- DV 2.0 expands the modeling of reference tables beyond hubs and satellites.
- Depending on business requirements, 2NF tables can now be used as reference tables, offering greater flexibility in how they are modeled.

These improvements make Data Vault 2.0 a robust framework for modern data warehousing needs.

‍

Data Vault Automation

There are numerous data vault automation tools available today. The unifying feature among these tools is their metadata-driven nature, ensuring consistency and flexibility throughout the entire lifecycle of a data warehouse. Consequently, data vault automation is predominantly template-driven. Metadata dictates the generation of data vault-specific objects, tables, and load processes, significantly reducing the development time required for ETL scripts. The automation process encompasses the following elements:

Entity Generation: Standardized entities such as hubs, links, and satellites can be loaded using templated SQL.
Lineage Analysis: Data lineage tracks the history of data transformations over time, detailing a dataset’s origin, movement, characteristics, and quality. Data vault automation tools include automatic data lineage analysis, which helps derive metadata and generate templated scripts for the data vault elements.
Data Governance: This involves processes, policies, and standards that ensure an organization can effectively utilize big data to achieve its goals. While organizations typically have their own data governance strategies, data vault automation tools can automatically derive governance guidelines from metadata.
DDL, DML, and ETL Generation: Data Definition Language (DDL), Data Manipulation Language (DML), and Extract, Transform, Load (ETL) processes are essential for data management within a data vault. DDL and DML define commands for creating, modifying, or removing database structures and data, while ETL specifies how data is extracted from the data lake, transformed, and loaded into the data warehouse.

‍

Case Study for Data Vault 2.0

A health organization specializing in connecting health and life companies with members of the healthcare ecosystem sought to enhance its marketing efforts by transitioning from legacy data silos to Data Vault 2.0. The company had amassed vast amounts of data over the years to tailor personalized campaigns for its clients. However, each new acquisition introduced additional data silos, making it difficult to integrate 12-year-old legacy data with new data for effective analytics and business intelligence. Compounding the issue, their on-premises data warehouse lacked documentation, rendering the ETL processes cumbersome and hampering business intelligence efficiency, which negatively impacted overall operations.

To address these challenges, the organization decided to automate the Data Vault lifecycle to accelerate time-to-value. They migrated their data from an on-premises Microsoft SQL Server environment to a cloud data warehouse. By adopting the Data Vault 2.0 modeling method, they could automate the entire process from development to deployment. This transition enabled them to quickly generate entities, analyze data lineage, establish robust data governance policies, and create DDL, DML, and ETL scripts. As a result, the organization significantly improved its data integration, analytics, and business intelligence capabilities, driving better business outcomes.

‍

Get Expert Help In Data Vault 2.0

At Codvo, our goal is to ensure that your business intelligence truly adds value. We embrace the agile Data Vault 2.0 methodology for modeling data warehouses, staying current with industry best practices to deliver efficient and streamlined BI processes. If you're interested in learning more, reach out to us at marketing@codvo.ai.