Engineering

Data Vault - The Complete Guide (2022 Edition)

What Are The Three Data Warehouse Models?

There are three main data warehouse models namely: Kimball, Inmon, and Data Vault. Each model has its own strengths and weaknesses. Hence, their usage highly depends on business requirements.

Data Warehouse Basics

A Data Warehouse is defined as a central repository for relevant business data. It is a type of data management system that is meant to support business intelligence (BI) by performing queries and analyses on large amounts of data such as application logs, transactional data, user data, and more.

As such, designers of data warehouses must take performance into consideration relative to business requirements. For instance, a business may have large streams of data at any given time hence fast writes are imperative. However, a business that does more querying than inserting might be more interested in a model that enables faster reads.

Characteristics of a Data Warehouse

Data Warehouses have five key characteristics namely: subject-oriented, integrated, time-variant, and non-volatile. These characteristics are the hallmarks of an efficient and business intelligence-ready data warehouse.

A data warehouse is subject-oriented. This means that a data warehouse is meant to host data that is relevant to a particular theme such as sales, distributions, and marketing. Second, a data warehouse is integrated. Meaning several sources of data can be pooled together. Third, it is time-variant. This pertains to data that is partitioned according to time-periods. Finally, it is non-volatile. This means that data can only be inserted--not replaced or deleted.

Inmon

Source: Geeksforgeeks.com

The Inmon approach involves an enterprise data warehouse that logically separates (normalizes) data into data marts. Data marts represent one particular area of focus and usually correspond to its own reporting tool. Examples of particular areas of focus include product details, business drivers, aspects, relationships, dependencies, and affiliations.

The enterprise data warehouse acts as the single source of truth to provide data consistency. The advantage of this model is that writing data is easy due to the normalized nature of the model. However, querying data is more difficult for the very same reason.

Kimball

Source: Geeksforgeeks.com

The Kimball approach is also often referred to as the star schema. A dimensional data warehouse logically segregates multiple interconnected data marts. A data mart in the Kimball model represents one subject area with several attributes. Common attributes connect data marts to each other.

Compared to the Inmon approach, the denormalized nature of the Kimball model makes it easy to query data because of its ground-up model. However, the tradeoff lies in the fact that there is no single source of truth--meaning there may be data irregularities.

Data Vault

Source: Researchgate

The Data Vault Model (DVM) was created by Dan Linstedt with the tagline “Turning Big Data into Sensible Business Information”. It is more than a model. It is also an architectural approach designed for agile warehousing. It is meant to deliver an enterprise data warehouse (similar to the Inmon approach) whilst removing the drawbacks of the 3rd Normal Form (3NF) and dimensional modeling techniques.

Features

Dan Linstedt defines the Data Vault as “a detail-oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business”. He defines its main features as the following:

  1. Structural information is separated from descriptive information for reasons of flexibility and avoidance of re-engineering
  2. Allows parallel loading of data
  3. Data is unprocessed and unfiltered
  4. Data is intact and never changes
  5. Data Vault structure does not allow the final exploitation of data

Tiers

1. Preparation Area

This layer collates data from multiple sources.

2. “Data Vault” and Business-Oriented Data Warehouse

This is considered the “core” layer. It contains untransformed data and entities such as hubs, links, and satellites. The Business Data Warehouse contains the resulting data from the application of business logic.

3. Data Marts

This layer is also known as the “presentation” layer. Data from the business data warehouse is exploited for reporting and mining by the OLAP cubes. Additionally, this layer is deployed via dimensional modeling.

Entities

Source: DWGeek.com

As mentioned earlier, there are 3 entities in a data vault: The hub, link, and satellite.

1. Hub

Contains a list of unique business keys having its own surrogate key. These business keys represent core business concepts such as customer, vendor, sale, or product. They are used to track and locate information within the business data warehouse.

2. Link

Establishes connections between hubs (business keys) and other links. Note that links do not hold data--they simply represent the relationship between data contained in the hub.

3. Satellite

Holds temporal attributes that can change over time and simultaneously acts as a historical record. It is a time-dimensional table that contains descriptive information which provides context to the hub’s/link’s business keys.

Why Data Vault?

Though all data models deliver relative to what is needed, today’s needs have evolved over the past decade. Some of today’s common business requirements include:

Data Vault aims to address all these needs in order to cater to modern business requirements. Here are the other key benefits to using the Data Vault:

1. It is a Business System-not just a Data Warehouse Model

It goes beyond hubs, links, and satellites. It addresses people, process, technology concerns, architectural structure, and continuous improvement. In short, it is a methodology that encompasses a model.

2. Continuously Evolving

There is a community surrounding Data Vault that strives for continuous improvement. Each year, thought leaders attend the worldwide Data Vault 2.0 conference to present new and innovative ways to adapt to emerging trends.

3. Agile

It does not take monumental effort to get a small production system up and running. There is visible value in very little time--a hallmark of agility. Additionally, its agile nature allows Data Vault to work with small to multi-petabyte datasets.

4. Integrated Analytics

Meaningful insights can be derived when multiple data types from different sources are analyzed in conjunction with each other. Data Vault’s integrated analytics makes it convenient matter to analyze multiple datasets.

5. Auditable

Data Vault is able to keep historical records of data. This means that tracing data vis-a-vis time is possible and provides audibility and accountability. This feature effectively allows the Data Vault to act as a source of truth whilst allowing constant data changes.

6. Supports Both Structured, Semi-structured, and Unstructured Data

Data Vault is not particular in terms of data. It can support any kind of data whether it be structured, semi-structured, or unstructured.

7. Supports Automation

Automation reduces operational expenses. Data Vault allows ETL code to be reduced to multiple SQL queries thereby reducing development expenses.

8. Works with Real-Time Data Feeds

The data stored in the data vault is viewable in real-time data feeds for stakeholder perusal.

9. Experimentation and Flexibility

Data Vault allows for experimentation because any unwanted data/components can be easily rolled back to the last working version. Developers and designers need not worry about effecting lasting changes.

Data Vault 1.0 Versus Data Vault 2.0

Data Vault 2.0 is the newest version of DVM that implements modern best practices. Here are the improvements that DVM 2.0 has made over DVM 1.0:

1. Hashkeys

The following can be used as primary keys: sequence numbers, hash keys, or business keys. Sequence numbers are not ideal because of the existence of an upper limit, lack of business meaning, and load issues which lead to negative performance impact.

Business keys on the other hand are mapped to business concepts (unlike sequence numbers). Additionally, they can be considered master keys because they will be the same value for the entire lifespan of the dataset regardless of data transfers. Also, they can allow for parallel loading.

A surrogate key is an object identifier in lieu of a direct business key. In Data Vault 2.0, hash keys are now being used as surrogate keys. This paves the way for parallel data loading because hash keys eliminate dependencies between entities. Compared to traditional keys, they can perform lazy joins, are deterministic, can represent unstructured and multi-structured datasets, and have a single field primary key attribute.

They are important because they allow the connection between heterogeneous data environments such as Hadoop and Oracle. They allow big data system loads to scale properly because of the lack of dependencies between data sources.

2. Links

Past usage of links included representation of relations, events, or transactions. This means that link-to-link constructs were common. In DV 2.0, the updated best practice is to only use links as business relations. The community believes that dubbing links as an event or transaction are deviant of the core pattern.

3. Reference Tables

The modeling of reference tables in DV 2.0 expands beyond hubs and sats. In fact, 2NF tables may be used as reference tables depending on business needs. In DV 2.0, we see wider flexibility with regards to how reference tables are modeled.

Data Vault Automation

Many data vault automation tools exist today. However, the common denominator across all these tools is that they are metadata-driven to ensure consistency and flexibility across the entire lifecycle of a data warehouse.

As a result, data vault automation is largely template-driven. Metadata will dictate the generated data vault-specific objects, tables, and load processes. This approach lessens the development time needed for ETL scripts. The following processes are included in automation:

- Entity Generation

Entities such as the hub, link, and satellite are standardized and can be loaded with templated SQL.

- Lineage Analysis

Data lineage pertains to the traceable history of transformations enacted upon data over time. It describes a dataset’s origin, movement, characteristics, and quality. Data Vault automation tools include automatic data lineage analysis in order to derive metadata and result in a templated script that will generate elements of the data vault.

- Data Governance

Data governance is a collection of processes, policies, and standards that makes sure that any organization that utilizes big data will be able to use said data to achieve its goals. Normally, organizations have their own data governance strategies in place that standardize the usage of data. However, with Data Vault automation tools, data governance strategy guidelines can be automatically derived from metadata.

- DDL, DML, and ETL Generation

DDL (data definition language), DML (data manipulation language), and ETL (Extract Transform Load) are integral to data management in a data vault. DDL and DML are standards for commands that define different database structures and define statements for the creation, modification, or removal of data.

ETL defines how data is fetched from the data lake, transformed, and loaded into a data warehouse.

Case Study for Data Vault 2.0

A health organization that connects health and life companies with members of the healthcare ecosystem by creating targeted marketing campaigns opted to use data vault 2.0 to replace legacy data silos.

Over the years, the company had collected enormous amounts of data in order to create personalized campaigns for its clients. Every acquisition would mean additional data silos. It was challenging to integrate 12-year old legacy data with fresh data in order to create analytics and relevant business intelligence.

Additionally, their data warehouse was built on-premises with no documentation. ETL became far too cumbersome for the organization and as such hampered the efficiency of business intelligence processes which in turn negatively impacted the business as a whole.

The solution was to automate the data vault lifecycle to accelerate time-to-value. The organization decided to migrate its data from an on-premises Microsoft SQL Server environment to a cloud data warehouse.

From there, the organization adopted the Data Vault 2.0 modeling method. In addition, they were able to automate the process from development to deployment. In a short amount of time, the organization was able to automatically generate entities, analyze data lineage, generate apt data governance policies, and generate DDL, DML, and ETL scripts.

Get Expert Help In Data Vault 2.0

Here at Codvo.ai, we want to make sure that business intelligence is meaningful. We believe that adopting the Data Vault 2.0 methodology is the agile way to model your data warehouse. Hence, we are experts in the Data Vault 2.0 modeling strategy where we make sure to stay up to date with industry best practices in order to provide efficient and streamlined business intelligence processes.

If you are interested, contact us today at marketing@codvo.ai

You may also like