Data modeling is the process of creating a structure to store data. Database modeling creates a model of your database; it captures information about the data that must be stored for an application. Applications are tools that implement some sort of business logic, so if we are creating a model of the data required by some application, then data modeling must start with collecting business requirements.
The first step in data modeling is to understand the business area and functionality being developed. We are providing a service and delivering business value. More specifically, we are solving the data side of the business problem.
We may already be collecting customer data, analyzing it, and drawing insights to drive business decisions. But we need to know more about what the source, the context, and the usage of that data are.
Data modelers need to talk business. Here are some starter questions:
- What is the purpose of the application?
- What is the business domain?
- What concepts exist in this domain?
- What are the problems to be solved by this application?
- What are the issues in this domain?
The answers to these questions will allow you to create a high-level conceptual model of the data or entities to be stored.
If someone tells you that you need to create a data model for a financial application and if it requires information about personal bank accounts, balances, and financial goals, then this would be appropriate for a personal finance management (PFM) application. On the other hand, if this is a financial application for a company, you would need to model customers, suppliers, invoices, payments, accounts, balances, etc. Some of the concepts are the same, but many are not, and their usage may be different.
You need to understand what business domain is being addressed and learn about the associated concepts. You can only do this by talking to the business people.
As they say, "a picture is worth a thousand words." We generally work with entity-relationship diagrams (ERDs) in database modeling. If you are interested in getting started with a graphical, collaborative database design tool, follow this advice to get started using Vertabelo. If you're unsure whether you need an ER Diagram (ERD), think again.
- How will your data model(s) be used? How do they fit into the overall project and product development lifecycle?
- Who will maintain the data model?
When data modeling, everything seems obvious. You give meaningful names to the entities so that their purpose is obvious. When choosing names for tables and columns, make it clear what the usage will be. In addition, you should have a simple document that explains the design choices you make. You want enough documentation so that the database can be maintained in the future. Documentation is about communicating the design and making it understandable for the future.
In addition, a good database modeling tool allows the model to be created collaboratively. We need to be able to collaborate across the design team, as well as with the development team to update the logical and physical data models during software development.
If you skip documentation now, it will be more costly in the future to make changes, identify problems, track bugs, and understand the nature of the data.
You should document the definition of the tables, the columns, and the relationships so that programmers can access the information. The documentation can also describe expectations of the database structure. Without documentation, your thinking is too shortsighted.
Next Level of Detail
The next level of questioning concerns:
- What are you going to do with these concepts or entities stored in the database? For what business purposes are we collecting the data?
- What are we looking to gain from this information?
- How much data do we have?
By clarifying these points, we narrow down the concerns we must address. We need to gather information that we can use to create our model, so that we can understand the conceptual domain and how to apply it to our database design.
For example, if you are storing customers, what information do you need to store? Why are you storing customer addresses – will you be sending them invoices? The answers to this next level of questioning will impact your data model.
In addition, by talking to the business, you have the understanding needed when considering trade-offs between performance, flexibility, security, and other non-functional requirements that may affect the data model.
An important question related to the usage of the data is the purpose for which the data will be used. Is the application's performance more important, or is it more important to be able to analyze and report on the data?
To put it in technical terms, is the application online transaction processing (OLTP) oriented or more related to online analytical processing (OLAP)? The response to this question will indicate the level of normalization that we want to make in the model: normalized for OLTP and denormalized for OLAP. You should not denormalize unless you have a clear business reason to do so, as denormalization will lead to redundant data that is more difficult to maintain.
“CRUD” applications are for online transactional processing (OLTP), while analytics are online analytical processing (OLAP). The main difference between an analytical and transactional database is the nature of the application using the data. A transactional database is designed for an application in which the user is more interested in CRUD, i.e., creating, reading, updating, and deleting records. With an analytical database, the user is more interested in analysis, reporting, and forecasting. In OLAP, the main focus of the database itself is simply to store data for analysis, no user maintenance of data. Analytical databases have fewer inserts and very few updates; the main goal is to retrieve and analyze data as quickly as possible. Therefore, it is important to consider the nature of the data usage for which you are designing your database.
Taking it Global
In today’s globally connected world, we may need to concern ourselves with applications that are being accessed from around the world. We may need to support an international audience. When thinking about our data models, it is important to consider how globalized an approach we need. Are there any country-specific requirements such as date, phone, postal code, or formatting that we need to manage in the data model?
The fact is that designers are often self-centric: they create data models that only properly handle their local time zones, addresses, ZIP codes, etc. A self-centric approach has a big problem: the resulting model will only support local data. Learn to think globally, not just locally.
For example, choose your data types correctly when modeling to support data that may not be as you expect. If you have a field for ZIP code that only allows 5 numbers, that will not work for countries with postal codes that are shorter or longer, or have alphanumeric postal codes. Similarly, if you need to store amounts, you may need to deal with currencies and exchange rates. Remember that not all currencies have 2 decimal digits; some have more, and some have less. If you are working with phone numbers, be wary of including constraints, as data model modifications will be required to support non-local users.
If you design a data model for a local audience, it may not be able to accept address information from a different country. For example, let's say we have a five-digit, numeric-only American ZIP code field. Someone from the UK may try to enter their eight-digit alphanumeric postal code. It just will not work. A global audience may not understand or use the same terms, formats, and layouts a local audience does. Enough flexibility should be incorporated into the model to handle as many global eventualities as possible. Don’t limit your data model of addresses to one locality; leave enough room for different styles.
Generally, users are going to want information extracted from the database and not just access it on the screen. How many users do we have? Where are they located? How often do they connect to the website? How long do they stay on the website? What is their behavior? What type of reports, if any, need to be generated?
The answers to these questions will help you during the database design to determine what types of information need to be stored, and how it may need to be manipulated for ETL (extract-transform-load), analytics, and reporting.
If you are working with sensitive data, you may need the ability to have the users authorize changes to data, roll back changes (think Undo with Ctrl+Z), and audit changes that users are making. How will you store active data and proposed changes? How will you ensure that changes can be rolled forward when approved and rolled back as required?
- Is the data highly sensitive? Does it possibly require a second user to approve changes before they are active?
- Do we need to be able to roll changes forward and backward? Under what conditions?
- What information do we need to store when changes are approved?
In many cases, this capability may not be required. But this is often a requirement in financial services to avoid fraudulent activities, where an estimated 80% of the frauds are committed by employees rather than external sources. In addition to storing the proposed changes, we need to track who proposed each change and when, as well as who approved the change and when, so that the changes are auditable.
Which brings up another issue: when the data needs to be audited, what detail of information is required? Do you need to know the user, the timestamp, and the IP address of the change? Do you need to keep historical copies of the previous modifications and past versions (snapshots) of the data?
If you answer these questions before completing your model, you will avoid headaches later on having to redesign your tables to support auditability, ability to apply changes only after approval, and the ability to roll back changes.
I often see databases created without a proper data retention and archiving strategy, or without any thought about auditing the changes that are made to critical data. What is your archiving strategy? How long will the data be kept online and available in active database tables?
Some systems are built to keep data in the database “forever.” However, this is usually not a reasonable long-term data retention strategy. At some point, data should be archived.
I advocate that data retention and archiving need to be part of your design considerations. Will you have active and historical tables so that inserts of new rows in the active tables remain fast, while searches on historical data can be optimized?
This avoids having to add archiving into your database design at a later stage.
Avoid Issues in Data Modeling
These are key points to consider when creating data models. Pay attention to the business usage of your data, and your database will be better designed and more robust. Working with the business to get organized will not take an enormous amount of time, but it will have an enormous impact on the quality of your data model.