Over the years, working as a data modeler and database architect, I have noticed that there are a couple rules that should be followed during data modeling and development. Here I describe some tips in the hope that they might help you. I have listed the tips in the order that they occur during the project lifecycle rather than listing them by importance or by how common they are.
1. Plan Ahead
Failing to plan is planning to fail.
One problem that I have seen is when data modeling occurs at the same time as software development. This is like building the foundation before completing the blueprints. In the past, planning seemed an obvious step before starting development. Development teams would not create databases without planning just like architects would not build buildings without blueprints.
In application development, it is critical to understand the nature of the data.
Planning is often ignored so that developers can just “start coding”. The project starts and when issues come up, there is no slack in the schedule to address them. Developers take shortcuts with the intent to fix them later but this rarely if ever happens.
Careful planning is how to ensure that you end up with a proper database that is not hacked together. If you don’t spend the time and effort upfront addressing the data required by the processes, you’ll pay for it later with a database that must be reworked, replaced or scrapped.
Even if planning isn’t always done, many data modelers still follow these guidelines. That’s not to say we can predict every design need in advance, but most modelers believe that it’s worth the effort to understand the data and its usage. You would not want a design for transaction processing when the need is analytic report creation.
Times have changed; Agile methodologies are more prevalent so database teams must rethink their approach to data modeling. In Agile, the Domain Model from Use Cases is used instead of Entity Relationship Diagrams. However, the need for planning has not diminished. We need to understand the data and what it’s supposed to do. In general, the first few Sprints must focus on data design.
So it’s not Agile that is the issue for database modelers, but rather individuals who do not grasp the nature of data. Some see database development as the same as application development. Database modeling and software development are different and need appropriate focus.
The database is the core of most software applications. You must take the time to analyze the requirements and how the data model will meet them. This decreases the chance that the development will lose course and direction.
The developers must understand the importance of data and its contribution to the development process. We live in the information age. Applications display and manipulate data. It is the information contained in the data that gives meaning to the application.
It is not possible to foresee every requirement nor every issue, but it is important to prepare for problems by careful planning.
2. Document Your Model
When making a data model, everything seems obvious. You name the objects so that their purpose is evident and everyone will understand the meaning just by reading the name. This may be true, but it isn’t as obvious as you might think. When choosing names for tables and columns, make it clear what the usage of each object will be. Over time, the meaning of objects will be unclear without documentation.
Using a naming convention is one step towards effective documentation. When you have to make changes in the future, you will appreciate any existing documentation. A short, simple document that describes the decisions that you made and describes the design will help explain the design choice at that time.
You want enough documentation so that the database can be managed by a new administrator and they can understand the meaning without having to come back to you for explanation. If the data model and the environment are not documented, it is difficult to maintain or change it as requirements change.
To some extent, documentation has little to do with the data modeling. Documentation is about communicating the design and making it understandable in the future.
Documentation is often an afterthought. When schedules are short, documentation gets ignored. Yet, this is technical debt with a high cost. Cutting corners during the development cycle will accrue costs in the future for database changes, problem identification, tracking bugs and for understanding the data model and nature of the data.
As an example, data models often have an “ID” field as the primary key for a table or a portion of the name of a key. This might be a primary key like
TransactionID on the
Transaction table. If some tables use “Number” as part of the name of a key, then it is good to document why. Perhaps
ReferenceNumber is used as the name of the primary key on the Message because that is what the reference is called in the business area. For example, in financial services, financial messages typically include a reference number.
Document the definition of tables, columns and relationships so that programmers can access the information. The documentation must describe expectations of the database structure.
In the Vertabelo tool, I can immediately include comments on any item: tables, columns, references, alternate keys, which means that the documentation is stored immediately with my model rather than in some extra document to be maintained separately.
Poor or absent documentation is often due to shortsighted thinking, but do not ignore its importance. This is still an issue to be addressed.
3. Follow Conventions
Naming conventions might not appear important during the design. In reality, names provide the insight to understanding a model. They are an introduction and should be logical.
Inconsistent naming serves no purpose. It only frustrates developers who must access the data, administrators of the database, and modelers who must make changes in the future.
When “ID” is used for some artificial keys but some tables use a different naming convention (such as Number), developers, analysts, and DBAs may waste time to understand the exceptions. Weak naming conventions also lead to errors in development because the naming is not consistent.
Hand-in-hand with documentation, using a naming convention makes it in the future for someone to understand the model. Do not randomly switch between using “ID” (like
CustomerID) and “Number” (
AccountNumber) as the keys for tables. Only make exceptions to the conventions when they are justified. Document what the exception is and why the convention is not respected.
The same applies to cryptic names like “XRT1” – is that the extended reference tables? Your guess is as good as mine. I hope that the designer knew why he chose such a cryptic name, but I doubt that the next person to access the database can guess the reason.
Naming conventions are a matter of personal choice. Make sure decisions are consistent and documented.
If I succeeded to convince you to apply naming convention in your database design, feel free to read my next article entirely devoted to this subject.
4. Think Carefully About Keys
Keys often generate controversy: primary keys, foreign keys, and artificial keys. Tables need a primary key that identifies each row. The art is to decide which columns should be part of the primary key and what values to include.
For proper normalization, each table needs an identifying key. Uniqueness must be guaranteed. Yet, natural keys and primary keys don’t have to be the same. In fact, they may not be, as long as the table has a natural key.
Some data modelers prefer an artificial key for uniqueness. Yet some modelers prefer a natural key to ensure data integrity.
So, should we use a natural key as the primary key? One challenge arises if the natural key must be changed. If the natural key consists of many columns, you may need to make changes in many places. Another challenge is using an artificial key as the only key for a table.
As an example, you might have a table storing information about products. The table may be defined with an artificial key such as a sequence, a code for the short alphabetic name for the product and the product definition. If uniqueness is ensured only by the artificial key, there may be two rows with the same product code. Are these the same product that is entered twice? Perhaps a key with the product code is more appropriate.
5. Use Integrity Checks Carefully
To ensure data integrity, we need foreign keys and constraints. Be careful not to overuse or underuse these integrity checks.
Domain tables are effective for enforcing integrity. Domain tables work well when there are many values to be checked against, or the values to be checked are frequently changing.
One issue can be that developers decide that the application will check integrity. The issue here is that a central database might be accessed by many applications. Also, you generally want to protect the data where it is: in the database.
If the possible values are limited or in a range, then a check constraint may be preferable. Let’s say that messages are defined as either Incoming or Outgoing, in which case there is no need for a foreign key. But, for something like valid currencies, while these may seem static, they actually change from time-to-time. Countries join a currency union and currencies change.
Applications should also perform integrity checks, but don’t rely only on the application for integrity checking. Defining integrity rules on the database ensures that those rules will never be violated. In this way, the data satisfies the defined integrity rules at all times.
6. Don’t Forget Indexes in Your Design
Some indexing design is useful during database modeling, even if indexes may change during actual deployment and usage. Of course, it is possible to have too many indexes, just like it is possible to have too few.
Indexing is an ongoing process. During design, you start the process on your model. Design work is on the primary keys and constraints.
Indexes are important when considering queries on the data. When modeling, you should consider how the data will be queried. Take care not to over-index. Indexing revolves around query optimization.
7. Avoid Common Lookup Tables
I have often seen a common lookup table for attribute pairs. Defining a single, generic domain table is perceived to simplify the design. This style of domain table makes an abstract definition for holding text. I have heard it called an “Allowed Value” or “Valid Values” table, but the term “MUCK” table was coined for this anti-pattern in 2006: Massively Unified Code-Key.
MUCK tables contain unrelated data.
For example, you could have a table that defines the domain, entry and a description. Thinking back to the message example above, two entries might be:
|1||I||Incoming message received by the bank|
|1||O||Outgoing message sent by the bank|
Now add in entries for another domain:
|2||SSI||Standard Settlement Instructions|
This is just a mess. What does the table mean?
Just for fun, I modeled a simple example of a MUCK table (or OTLT, “One True Lookup Table” if you are a Tolkien fan) and included some comments. Please note that this is an anti-pattern and I am not recommending that you use it in your data model.
With MUCK tables, constraints cannot be defined. MUCKs may become many rows without any meaning. When you must query another table to understand the meaning of a field, this is not ideal.
These “anything goes” tables make integrity checks complex or even impossible. One reason that these tables may be created is because several tables within the database have a similar definition. Then data integrity checks become impossible. Also, their size may become quite large.
Normalization should lead away from MUCK tables. A single table should represent a single domain; rather than a single (MUCK) table representing all the domains. Without MUCK tables, we can put in place foreign key constraints.
Use separate tables for domain objects rather than cramming them into a single table. This allows proper column types, constraints and relationships. An “Allowed Values” table is just muck and has no place in a data model.
8. Define an Archiving Strategy
All too often, I have seen databases created without a proper strategy of data retention and archiving. How long will data be kept online available in active database tables? Most systems are built to keep data in the database “forever”. For most systems, this is not a reasonable long-term data retention strategy. At some point, active data should be archived.
One approach that I advocate is to include data retention as part of your design considerations. Will you have active and historical tables so that inserts of new rows in the active tables remain fast, while searches on historical data can be optimized?
This avoids having to redesign archiving into your database on top of the original design.
9. Test Early, Test Often
To paraphrase Al Capone (or John Van Buren, son of the 8th U.S. President), “test early, test often”. In this way, you follow the path of Continuous Integration. Testing at an early development stage saves time and money.
Testing is always a challenge in the development plan. There is often a test phase at the end of an Agile Sprint and system testing at the end of development. Testing is generally the first thing to be squeezed when time gets short.
In testing the database, the goal should be to simulate a production environment: “A Day in the Life of the Database”. What volumes can be expected? What user interactions are likely? Are the boundary cases being handled?
So the testing plan and proper testing must be an integral part of the data modeling and database development.
These are the main issues that I have seen when working with data modelers and reviewing data models. By paying attention to these tips, your database will be better designed and more robust. Yet, the payback on some of these investments is not always obvious or visible. Plan, document, use standards, create keys, ensure integrity, perform indexing, avoid MUCK, develop strategies, and TEST!
None of these activities will take an enormous amount of time yet have an enormous impact on the quality of your data model.
What are your views about these tips?