There are three main data warehouse models namely: Kimball, Inmon, and Data Vault. Each model has its own strengths and weaknesses. Hence, their usage highly depends on business requirements.
A Data Warehouse is defined as a central repository for relevant business data. It is a type of data management system that is meant to support business intelligence (BI) by performing queries and analyses on large amounts of data such as application logs, transactional data, user data, and more.
As such, designers of data warehouses must take performance into consideration relative to business requirements. For instance, a business may have large streams of data at any given time hence fast writes are imperative. However, a business that does more querying than inserting might be more interested in a model that enables faster reads.
Data Warehouses have five key characteristics namely: subject-oriented, integrated, time-variant, and non-volatile. These characteristics are the hallmarks of an efficient and business intelligence-ready data warehouse.
A data warehouse is subject-oriented. This means that a data warehouse is meant to host data that is relevant to a particular theme such as sales, distributions, and marketing. Second, a data warehouse is integrated. Meaning several sources of data can be pooled together. Third, it is time-variant. This pertains to data that is partitioned according to time-periods. Finally, it is non-volatile. This means that data can only be inserted--not replaced or deleted.
Source: Geeksforgeeks.com
The Inmon approach involves an enterprise data warehouse that logically separates (normalizes) data into data marts. Data marts represent one particular area of focus and usually correspond to its own reporting tool. Examples of particular areas of focus include product details, business drivers, aspects, relationships, dependencies, and affiliations.
The enterprise data warehouse acts as the single source of truth to provide data consistency. The advantage of this model is that writing data is easy due to the normalized nature of the model. However, querying data is more difficult for the very same reason.
Source: Geeksforgeeks.com
The Kimball approach is also often referred to as the star schema. A dimensional data warehouse logically segregates multiple interconnected data marts. A data mart in the Kimball model represents one subject area with several attributes. Common attributes connect data marts to each other.
Compared to the Inmon approach, the denormalized nature of the Kimball model makes it easy to query data because of its ground-up model. However, the tradeoff lies in the fact that there is no single source of truth--meaning there may be data irregularities.
Source: Researchgate
The Data Vault Model (DVM) was created by Dan Linstedt with the tagline “Turning Big Data into Sensible Business Information”. It is more than a model. It is also an architectural approach designed for agile warehousing. It is meant to deliver an enterprise data warehouse (similar to the Inmon approach) whilst removing the drawbacks of the 3rd Normal Form (3NF) and dimensional modeling techniques.
Dan Linstedt defines the Data Vault as “a detail-oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business”. He defines its main features as the following:
This layer collates data from multiple sources.
This is considered the “core” layer. It contains untransformed data and entities such as hubs, links, and satellites. The Business Data Warehouse contains the resulting data from the application of business logic.
This layer is also known as the “presentation” layer. Data from the business data warehouse is exploited for reporting and mining by the OLAP cubes. Additionally, this layer is deployed via dimensional modeling.
Entities
Source: DWGeek.com
As mentioned earlier, there are 3 entities in a data vault: The hub, link, and satellite.
Contains a list of unique business keys having its own surrogate key. These business keys represent core business concepts such as customer, vendor, sale, or product. They are used to track and locate information within the business data warehouse.
Establishes connections between hubs (business keys) and other links. Note that links do not hold data--they simply represent the relationship between data contained in the hub.
Holds temporal attributes that can change over time and simultaneously acts as a historical record. It is a time-dimensional table that contains descriptive information which provides context to the hub’s/link’s business keys.
Though all data models deliver relative to what is needed, today’s needs have evolved over the past decade. Some of today’s common business requirements include:
Data Vault aims to address all these needs in order to cater to modern business requirements. Here are the other key benefits to using the Data Vault:
It goes beyond hubs, links, and satellites. It addresses people, process, technology concerns, architectural structure, and continuous improvement. In short, it is a methodology that encompasses a model.
There is a community surrounding Data Vault that strives for continuous improvement. Each year, thought leaders attend the worldwide Data Vault 2.0 conference to present new and innovative ways to adapt to emerging trends.
It does not take monumental effort to get a small production system up and running. There is visible value in very little time--a hallmark of agility. Additionally, its agile nature allows Data Vault to work with small to multi-petabyte datasets.
Meaningful insights can be derived when multiple data types from different sources are analyzed in conjunction with each other. Data Vault’s integrated analytics makes it convenient matter to analyze multiple datasets.
Data Vault is able to keep historical records of data. This means that tracing data vis-a-vis time is possible and provides audibility and accountability. This feature effectively allows the Data Vault to act as a source of truth whilst allowing constant data changes.
Data Vault is not particular in terms of data. It can support any kind of data whether it be structured, semi-structured, or unstructured.
Automation reduces operational expenses. Data Vault allows ETL code to be reduced to multiple SQL queries thereby reducing development expenses.
The data stored in the data vault is viewable in real-time data feeds for stakeholder perusal.
Data Vault allows for experimentation because any unwanted data/components can be easily rolled back to the last working version. Developers and designers need not worry about effecting lasting changes.
Data Vault 2.0 is the newest version of DVM that implements modern best practices. Here are the improvements that DVM 2.0 has made over DVM 1.0:
The following can be used as primary keys: sequence numbers, hash keys, or business keys. Sequence numbers are not ideal because of the existence of an upper limit, lack of business meaning, and load issues which lead to negative performance impact.
Business keys on the other hand are mapped to business concepts (unlike sequence numbers). Additionally, they can be considered master keys because they will be the same value for the entire lifespan of the dataset regardless of data transfers. Also, they can allow for parallel loading.
A surrogate key is an object identifier in lieu of a direct business key. In Data Vault 2.0, hash keys are now being used as surrogate keys. This paves the way for parallel data loading because hash keys eliminate dependencies between entities. Compared to traditional keys, they can perform lazy joins, are deterministic, can represent unstructured and multi-structured datasets, and have a single field primary key attribute.
They are important because they allow the connection between heterogeneous data environments such as Hadoop and Oracle. They allow big data system loads to scale properly because of the lack of dependencies between data sources.
Past usage of links included representation of relations, events, or transactions. This means that link-to-link constructs were common. In DV 2.0, the updated best practice is to only use links as business relations. The community believes that dubbing links as an event or transaction are deviant of the core pattern.
The modeling of reference tables in DV 2.0 expands beyond hubs and sats. In fact, 2NF tables may be used as reference tables depending on business needs. In DV 2.0, we see wider flexibility with regards to how reference tables are modeled.
Many data vault automation tools exist today. However, the common denominator across all these tools is that they are metadata-driven to ensure consistency and flexibility across the entire lifecycle of a data warehouse.
As a result, data vault automation is largely template-driven. Metadata will dictate the generated data vault-specific objects, tables, and load processes. This approach lessens the development time needed for ETL scripts. The following processes are included in automation:
Entities such as the hub, link, and satellite are standardized and can be loaded with templated SQL.
Data lineage pertains to the traceable history of transformations enacted upon data over time. It describes a dataset’s origin, movement, characteristics, and quality. Data Vault automation tools include automatic data lineage analysis in order to derive metadata and result in a templated script that will generate elements of the data vault.
Data governance is a collection of processes, policies, and standards that makes sure that any organization that utilizes big data will be able to use said data to achieve its goals. Normally, organizations have their own data governance strategies in place that standardize the usage of data. However, with Data Vault automation tools, data governance strategy guidelines can be automatically derived from metadata.
DDL (data definition language), DML (data manipulation language), and ETL (Extract Transform Load) are integral to data management in a data vault. DDL and DML are standards for commands that define different database structures and define statements for the creation, modification, or removal of data.
ETL defines how data is fetched from the data lake, transformed, and loaded into a data warehouse.
A health organization that connects health and life companies with members of the healthcare ecosystem by creating targeted marketing campaigns opted to use data vault 2.0 to replace legacy data silos.
Over the years, the company had collected enormous amounts of data in order to create personalized campaigns for its clients. Every acquisition would mean additional data silos. It was challenging to integrate 12-year old legacy data with fresh data in order to create analytics and relevant business intelligence.
Additionally, their data warehouse was built on-premises with no documentation. ETL became far too cumbersome for the organization and as such hampered the efficiency of business intelligence processes which in turn negatively impacted the business as a whole.
The solution was to automate the data vault lifecycle to accelerate time-to-value. The organization decided to migrate its data from an on-premises Microsoft SQL Server environment to a cloud data warehouse.
From there, the organization adopted the Data Vault 2.0 modeling method. In addition, they were able to automate the process from development to deployment. In a short amount of time, the organization was able to automatically generate entities, analyze data lineage, generate apt data governance policies, and generate DDL, DML, and ETL scripts.
Here at Codvo.ai, we want to make sure that business intelligence is meaningful. We believe that adopting the Data Vault 2.0 methodology is the agile way to model your data warehouse. Hence, we are experts in the Data Vault 2.0 modeling strategy where we make sure to stay up to date with industry best practices in order to provide efficient and streamlined business intelligence processes.
If you are interested, contact us today at marketing@codvo.ai