Is Data Modeling dead

Do we still need data models? What is the value of models? Why did data modeling failed in the past? How can we create data models? How can we create value by data modeling? All of this I try to answer in the following presentation.

Is Data Modeling dead?

The interesting question is: Is data modeling dead? And this is not a hypothetical question. It started some years ago when I was at a conference in the US, and some very famous data modelers, who had been working for big insurance companies for 30-40 years, asked: ‘Is data modeling dead?’ They were doing their job, creating value, but they didn’t feel appreciated anymore. The value of data modeling wasn’t seen as much as before.

So let’s start by understanding what data modeling is and why we need it. For context, I work for a data warehouse automation tool called Data Vault Builder. We believe in working on business models and automatically converting them into working code. This is my personal perspective.

Now, let me share my professional experience and a simplified history of data modeling. I began working in the data warehousing area in the year 2000. Initially, I worked for a financial institution where we focused on data reconciliation. Later, I transitioned to working on telecommunications variables and led a data management team for a telecom provider. We dealt with more complex tasks.

In 2012, I faced challenges with agile data warehousing and near real-time data warehousing. I tried using data vault as a modeling paradigm, and it worked well. However, we realized that automation was necessary for it to be effective. This realization in 2012 laid the foundation for Data Vault Builder. Since then, I have been with the company, and we have been developing this tool as a software renderer.

Today, my role involves pre and post-sales activities, as well as conducting presentations to train and enable people to exchange ideas within the community.

What are data models?

So what do models mean to me? They provide meaning to complex things by simplifying them. A model is usually a simplified representation of something more complicated. If you look up “ontology” on Wikipedia, you will find different kinds of elements. We have classes or concepts, properties, relationships, axioms, and constraints. These basic elements of data modeling have remained consistent throughout history, although there may be variations. The core ideas remain the same. This means that if you learn data modeling at one point in time, you will be well-prepared for the future, for many years to come.

Let’s take an example of a tree. A real tree with all its leaves and branches has thousands of different aspects. Trying to capture a real tree in a database or a file would be enormous, and often unnecessary for analytical purposes or in a business context. Instead, we can create a simpler model that represents key characteristics. For example, we can use a model with green parts, brown parts, and the shape of a tree. Even with this simplified representation, if I show you the picture, you will understand that it represents a tree. The simplest representation would be just the outline, and yet you would still recognize it as a tree. Counting the number of trees would be straightforward. This level of simplicity is sufficient for many applications.

There is a great picture that illustrates the relationship between the real thing and its model.

It says, ‘This is not a pipe’; it’s a picture of a pipe. It’s a simplified, two-dimensional representation, but you can still recognize what it is. However, it’s important to remember that it’s not the real thing. We can use these kinds of models to simplify and explain things.

Now, let’s consider the forest in this picture. If I show you this picture, you might say it’s a nice forest, but what is it all about?

Here, we can simplify the concept.

We can say that within a forest, there are many trees. A forester takes care of the forest, or perhaps multiple forests. Different species of trees exist, and there are owners. By showing you this simple picture, you can already understand the basic concept.

It appears that somebody likely wants to profit from the trees, but they need to take care of the trees so that they can grow and be used later. This becomes a means of communication. Different people from various business areas within the same company can come together and discuss, ensuring they have the same understanding. They can define their perspectives, make sense of things, collaborate, and foster important collaboration between IT and business people, who often have different viewpoints. This layer allows them to find common ground and say, ‘Okay, we understand each other well enough to work together.’

So, where does all this come from? Some people claim that Edgar F. Codd is the father of relational databases. He wrote an influential book about it in the 1970s. When you observe this, you realize that it’s not something new. You will see that all subsequent approaches to data modeling rely on the same principles, because these principles remain consistent. I will emphasize this repeatedly because it’s crucial: data modeling is about identifying and describing things, and establishing relationships between them. Additionally, it involves working on ACID (Atomicity, Consistency, Isolation, Durability), which defines how transactions are handled in databases, and normalization.

Speaking of normalization, I tried to find a fitting picture for ACID, but the only thing I found was a happy face 🙂 I’m not sure what that was about.

Normalization

However, normalization is the aspect we’re interested in. It involves dividing data into primary keys for identification, foreign keys for establishing relationships, and attributes for describing things. One of the primary goals of normalization is to eliminate data duplication. For example, instead of repeating the full description of a customer type for every customer, we can create a separate table for customer types and refer to it through a relationship. This approach saves hardware costs, as storage memory and CPU were very expensive in the past. Although it may sacrifice some maintainability, reducing hardware costs is crucial. Therefore, it’s important to remember the following principle: every attribute should provide information about the key, the whole key, and nothing but the key. (So, help me Codd.)

So we should understand what Codd was about – online transaction processing systems (OLTP). These systems are focused on managing business processes. However, later on, new systems emerged that aimed to provide better views and reporting capabilities. This introduces a different aspect because now we are querying massive amounts of data, not just single records. Speed is crucial, as data warehousing systems span a long time period, and we want to track historical changes.

For example, when customers change their names or when leads become customers, we need to observe how things change over time, such as customer lifetime value. In online transaction systems, surrogate keys or technical keys are often used for simplicity in handling changes. These keys are not stable over time, but it doesn’t matter since they are relevant to the current moment.

However, in the data warehousing world, we prefer to slowly transition to using business keys to identify entities, ensuring stability over time, even if IT systems are replaced. But this approach has some downsides. For instance, if I were to change my name after getting married in certain countries, using just my name and birth date as the key to identify me would make me disappear and reappear as a new entity. Models are never perfect, and we always need to consider what we want to trace and how we can identify things. The world is imperfect, and we may miss certain changes, such as a name change, assuming it’s a new customer.

The 3NF Data Warehouse

Now, let’s shift our focus to the analytical world. In 1992, Bill Inmon, often referred to as the father of data warehousing, came into the picture. He began building the data warehouse and defined it as a subject-oriented, nonvolatile, integrated, time-variant collection of data in support of management decisions. To be honest, this definition still holds true for many cases.

If we look at approaches like data lakes, they can fulfill these requirements if implemented correctly. However, some data lakes fall short in replacing a data warehouse entirely. They may replace certain functions but require additional means to achieve the full functionality. It’s not impossible, just something to be aware of.

Only when you address all these aspects, such as data integration, nonvolatility, and time-variant capabilities, do you have the complete picture. Inmon used third normal form databases, which were improving at the time, to relate data to each other. In addition to primary and foreign keys and attributes, he introduced surrogate keys. This was done to save different versions of information related to a specific primary key. For example, if I change my name and the company is aware of the change, different versions of my information could be stored, indicating validity from a specific point in time. The challenge with this approach is maintainability.

Foreign keys are used, and in a fully historized data warehouse with multiple versions of information, any change to the customer entity, including adding new columns, can impact the entire history of related entities. This cascading effect through the data model makes changes extremely costly. Initially, when starting with an enterprise data warehouse and implementing a third normal form structure, it can be straightforward and simple. However, as changes occur and data sources evolve, it can become a nightmare. Many data warehouse projects start strong but gradually become slower and slower until they eventually die and are replaced by new solutions. This led to the emergence of approaches, particularly advocated by Ralph Kimball, that aimed to simplify the process.

The Dimensional Model

We should consider what was the purpose of what Kimball called conformed dimensions. He integrated all the data from different IT systems into one customer dimension, as denormalizing the data since storage has become much cheaper over time. Denormalizing the information makes it simpler to query with fewer joins.

Another approach Kimball took was building lean fact tables that reference different dimensions. However, maintainability remains a concern, especially when dealing with historical dimensions. Many people at that time decided to historize everything without assessing the business value. When adding new columns, it requires recreating the dimension history and rewriting the fact table. The impact is limited to the fact table and dimensions since they are denormalized objects, making the impact smaller.

It was initially simpler to implement, and many believed it could be a way forward, especially when dealing with only two or three source systems. However, adding a fourth or fifth system became exponentially more expensive.

Another contribution attributed to Kimball is that he made querying data very simple. With the many-to-one relation from the fact table to different dimensions, you can filter on any two or more dimensions and obtain a meaningful result without encountering fanning traps or data multiplication. His approach also helps identify incompatible hierarchies. Kimball’s presentation style still hold true today as many reporting tools are optimized for this kind of data .

While he didn’t solve the maintainability issue, he simplified querying. This is where Data Vault comes in and suggests that we should go back to the normalized core and improve it instead of replacing it. For presenting data, the dimensional model is still appropriate, but it should be created in a virtual or rebuildable way based on the normalized core.

Data Vault Modeling

We return to the concept of defining and identifying stuff, where a single object called the Hub stores only the immutable key. The attributes with history are stored in separate database tables (satellites), and relations are stored in their own relation table (links). By separating the functions of the data into distinct objects, Linstedt achieved benefits in terms of maintainability.

This approach was made possible around the year 2000 when joining tables, computing power, and memory became more affordable.

While it may not be as physically efficient as a star or snowflake model, it offers advantages in terms of maintainability. However, for the most performance-critical systems, modeling this purely as a data vault may not be the best choice.

However, for master data and other areas where data volume is typically not extensive, even in larger companies, the data vault model works well, and it’s worth investing slightly more in hardware and compute resources to save effort in implementation and maintenance. This approach involves splitting the data into smaller components, but it’s not entirely new as ontologies already defined it; it’s a novel approach to physically store the data this way.

The evolution from the third normal form to the data vault model still relies on the dimensional model for querying, which can be a virtual view since it’s less normalized than the data vault model.

It’s pretty straightforward, and you can automate it just by defining what you want to achieve. To do this, we need a staging layer, which has remained unchanged since the beginning. The data from the sources needs to be copied and mapped to the Data Vault core, and this process can also be automated using the business key definition and denormalization.

Other approaches to modeling

Apart from these approaches, if you want to learn more about Anchor Modeling, I recommend reading old articles by Lars Rönnbäck, a clever individual with many innovative ideas. He addressed some problems that are not well-automated in data world modeling, but his approach involving attributes and single-table structures is quite intriguing.

It’s a completely different approach to data modeling, taking inspiration from FCO-IM (Fully Communication Oriented Information Modeling). The idea behind FCO-IM is that business users, who are typically not trained in data modeling, can express true statements about their business, such as ‘a student lives in a city.’ From these types of sentences, data models can be reverse-engineered, and data can be extracted, including things, relations, and attributes. There are even tools available to automate this process, which is quite fascinating. If you’re interested, I recommend giving it a read.

Additionally, there’s the BEAM approach, which I have less experience with, but I can refer you to the book as many colleagues of mine have recommended it.

We also have ensemble logical modeling, which involves structured workshops to extract knowledge about the business from users and break it down into events, people, places, and things, assigning attributes and building a model from there.

Again, the emphasis is on a conceptual model first, identifying things later, and adding relations and attributes. If you understand these fundamental principles, you’ll be well-equipped to handle various data modeling approaches.

Is data modeling dead?

Now, let’s move on to the final question: Is data modeling dead, considering the history I’ve shown you? I believe that creating data models solely on paper, even if you use software, without any connection to real-world implementation in an automated way, without demonstrating the value directly to business users so they can use and benefit from it, is not an effective approach. If you don’t automate the code, developers will struggle to understand the value of the data models. Therefore, this approach is unlikely to succeed.

Another problem is the time it takes to create data models before starting implementation, which initially doesn’t create any value. This is not a viable way forward. Instead, we need to adopt agile data modeling, linking conceptual, logical, and physical implementations of the models in an automated manner. This approach brings benefits to both business users and developers, and it is the right direction to follow.

Conceptual, Logical and Physical

So, in the next few minutes, let’s discuss how we can link data models to real-world implementations. But before we delve into that, let’s briefly define what conceptual, logical, and physical data models are, although the boundaries between these layers may vary. Conceptual models are about capturing the concepts, classes, and entities based on your modeling paradigm, encompassing things and places and the actions that occur.

It involves defining what a customer is or what a product represents. When we start data modeling, we often use tools that separate different types of models. We begin with the conceptual model, export it, import it into a logical model, then repeat the process to create a physical model. However, this approach creates synchronization issues since the models are not automatically aligned. That’s why I believe it’s crucial to maintain all the models in one place.

So, for me, the conceptual model is the starting point of the logical model, and then we add the relations and key definitions. By doing this gradually, the conceptual model is converted into the logical model. However, I don’t consider them as two separate models or tools; rather, they should be part of a workflow. As we work in an agile manner, it’s possible that some parts of the model exist at the conceptual layer, while other parts are already at the logical layer.

What’s important to me is that there is a direct relationship between them, not just from conceptual to logical, but also to the physical implementations, such as creating tables, columns, and indexes in the database. In my opinion, this should happen automatically because in the past, this disconnect between the different layers led to the belief that data modeling was dead.

The transition from conceptual to logical to physical models used to take months, making it difficult for business users to comprehend. The idea here is to automate this process and immediately produce output within the agile paradigm.

Another problem with the different layers was that the workflow sometimes worked in one direction only. If people made direct changes to the database due to production incidents, the models were not updated accordingly. In such scenarios, the models became unreliable, causing serious problems. Therefore, a two-way connection is necessary. If we want to revamp data models and make them effective, we need to move from conceptual to logical to physical in real-time. Additionally, if any changes are made to the physical implementation, the data model should be updated accordingly.

I believe this is the key to making data models valuable, as they create business value and benefit the implementation side.

CI/CD

To achieve this, we need to separate our development, test, and production environments. Otherwise, if we modify the model and directly change the production environment, it would be disastrous. We would lack control and versioning, and it would not comply with ITIL standards, which is typically our goal.

However, we still need a quick-fix process to address any issues in production. This can be achieved by merging the changes back into a development environment. Usually, the answer to this is an ITIL inspired process.

If we aim to become truly agile, we can work with Git and GitFlow, adopting a distributed development approach. This means developing different features or stories in isolated branches and deciding the deployment order freely. We have complete control, full versioning, and this approach was already defined by Java developers.

Java Developers use a working folder on their local machines, a local and a central Git repository. They create branches, check them out into their local Git repositories, and work on specific features in one branch in their working folder.

However, in the data warehousing world, we don’t have a direct equivalent. We work with databases and data, so we need to find a way to make it work. In my view, the equivalent of a working folder in the Java world is a sandbox. Therefore, we need to explore how to work with sandboxes, synchronize them with the integrated development environment, integrate them into the GitFlow-based process, and others have reached the same point and offered different solutions.

Pipelines as Code

One alternative solution proposed by others is to use pipelines as code. When we have code, we can utilize the working folder approach, work with files, and follow a fully GitFlow-based process. They are correct in stating that you can branch your work, merge it after basic tests, perform automated testing, and handle pull requests. You can work on different branches and define the deployment order. This is highly beneficial for achieving continuous integration and continuous delivery (CI/CD). However, there is a major drawback to this approach, especially when considering DBT or DBT Vault. It lacks the modeling aspect, which gives meaning to the process.

So, instead of focusing on business requirements, technology takes precedence, and as a result, the entire meaning is lost. The modeling aspect is confined within the code, requiring the need for code reverse engineering. While developers may excel in this aspect, it poses a question: Can we achieve the same outcome without compromising the modeling aspect? How can we prepare data models for continuous integration and continuous deployment (CI/CD)?

Bringing the models back alive

The first step is to establish an agile data model, and this is where Data Vault comes into play. It aligns with the fundamental concept of incorporating new elements seamlessly, avoiding conflicts when merging different models. Starting small and making changes cost-effective is the key – and the changes do not become more expensive as the project progresses. The loose coupling of elements guarantees independence, allowing for deployment in different orders. We can begin data modeling at different points in the model and merge them later, ensuring a favorable outcome from a modeling perspective.

The second aspect is to automate the transformation of a modeling artifact into a physical artifact. Fortunately, this process is well-suited for automation since we have already separated the functions of data into different model elements (identifying, describing, relating). Consequently, we can transform a hub into its physical representation, such as a database table, in a deterministic way.

Similarly, we can take a physical table and transform it into a logical object. To capture and preserve metadata effectively, we represent the data model as files that adhere to Git’s requirements. Each hub, link, and satellite corresponds to a single file. By making these files human-readable, we enable conflict resolution and enhance comprehension. Proper sorting and formatting of these files allow us to leverage branching and merging at the logical model level.

Now, we have a similar approach to those who work with data pipelines as code, but with an additional layer of meaning. We have a means of communication and documentation that explains the relationships within the data model. We can effectively collaborate with business stakeholders, automatically export metadata to their reports, and showcase the value of their contributions to the data modeling efforts. This bridges the gap and benefits both developers and business professionals, breathing life into our data models.

This is the essence of the Data Vault methodology, and in just a few minutes, I can demonstrate how the Data Vault Builder facilitates a fully automated modeling-to-code process. I will show you how we can automate deployment and streamline operations, aligning with our belief that the data model should serve as the foundation, while everything else should be as automated as possible.

We are looking for an enthusiastic Technical Trainer to educate people in using the Datavault Builder software. You will develop technical training classes using the Userfolow platform for in-tool training, training videos, knowledge base articles, manuals and classroom training material and help our implementation partners and customers develop skills that will help them to use our software.

We expect you to be knowledgeable in Data Modeling and Database Technologies and to possess a solid technical aptitude. Additionally, we expect you to be an excellent communicator, able to explain complex subjects in a clear and interesting way. Even your day to day work is mainly of developing and evolving the content you will also partially giving online-trainings to our partners and customers and support our help-desk to develop training content based on customer inquiries.

Contact us