When designing a database, early decisions can have a huge impact on the performance and storage requirements. These decisions can be difficult to change later, as most subsequent work will depend on the physical model. This article highlights some common design decisions, flaws, and misconceptions.
Creating a Primary Key: Uniqueidentifer or Integer
When defining a surrogate primary key for a table, two options are the most common: Integer and UniqueIdentifier (aka. Globally Unique Identifiers, or GUIDs).
-- Create a table with an auto-incrementing INTEGER key CREATE TABLE IntegerTable (MyIncrementingID INT IDENTITY PRIMARY KEY CLUSTERED)
-- Create a table with a uniqueidentifier key CREATE TABLE GUIDTable (MyGUID Uniqueidentifier DEFAULT NEWID())
Historically, Integer has been the logical choice. It’s human-readable, requires minimal storage, and can be set as an identity (auto-incrementing) to prevent the need for additional application logic.
With high availability and replication becoming highly prevalent, UniqueIdentifier is being chosen more often, but has caveats that mean it isn’t always the optimal solution.
UniqueIdentifier comes with significant disadvantages. The most immediately noticeable is that it’s user-unfriendly. You’ll never hear a user or developer ask you about record “A78383A3-4AB1-42CF-B3FC-A4A23AD10398”.
GUIDs are typically stored automatically during the insert process by specifying NEWID() or NEWSEQUENTIALID() as the default value for a table:
CREATE TABLE GUIDTable (MyGUID Uniqueidentifier DEFAULT NEWID(), CharValue CHAR(1)) INSERT GUIDTable (CharValue) VALUES ('A'), ('B'), ('C') GO SELECT * FROM GUIDTable
This allows the database engine to handle value generation. Note that there is no discernible pattern when using NEWID().
GUIDs have a staggering amount of potential values (2^122) - making BigInt look like a speck in comparison), although storage and performance would prevent that number from ever being reached in the real world.
The biggest concerns come from performance and storage. UniqueIdentifier is four times larger than a standard integer column, and is a poor choice for clustered indexes due to its inherently random nature. These issues extend to any tables with a foreign key that references a UniqueIdentifier; storage is increased and clustered indexes suffer.
However, UniqueIdentifier is the only correct selection in any replication-like scenario. If multiple sources may be generating data for the same table (as in SQL Server replication), a UniqueIdentifier is required to ensure uniqueness of the primary key. A standard auto-incrementing key could have potential collisions when the replication process takes place.
Tim Ford has a well-documented explanation of the fragmentation that occurs when applying a clustered index to a UniqueIdentifier column. While a more recent SQL function NEWSEQUENTIALID() mitigates the fragmentation issues, it does not address the storage concerns. This option can only be specified as the default in the table definition.
NewSequentialID() allows a more discernible pattern, improving performance for clustered indexes:
CREATE TABLE SequentialTest (TestGuid UniqueIdentifier DEFAULT NEWSEQUENTIALID(), X Char(1)) Insert SequentialTest (X) Values ('R'), ('Q'), ('K') Select * from SequentialTest
As a general rule, UniqueIdentifier should be used in any situation where replication is a future possibility. Otherwise, integer datatypes will suffice. When modeling single-location databases such as a centralized data warehouse, the difference in storage space over multiple tables and billions of rows can add up to huge savings.
Many third-party databases (including Microsoft Sharepoint) use UniqueIdentifier as the primary key in their tables. This is due to the need to cover all potential client scenarios, rather than an official endorsement of UniqueIdentifier over integer.
Storing Monetary Data
Database novices often choose FLOAT as the data type for their monetary data. This is wrong! FLOAT should never be used to store financial data. There is no debate over that. While FLOAT uses eight bytes and allows you to store a large range of data, the FLOAT type uses binary representation and binary arithmetic. It is thus intrinsically imprecise.
For storing monetary data you should use the fixed-point datatype MONEY. It consumes 8 bytes, but guarantees accuracy (albeit with a lower range of values). MONEY is accurate to four decimal places, and should be used in financial situations where that precision is required.
Numeric and decimal are synonymous. If greater precision and scale are needed, they can be specified with these datatypes – I.E. Numeric (15, 6) or Decimal (23, 10). Storage space will increase along with the range of potential values.
For situations where precision is needed but database storage and speed are truly at a premium, a less orthodox approach is to store financial data as BIGINT, allowing the application to make the appropriate calculations during retrieval time.
Storing financial data as Bigint:
CREATE TABLE FinancialExample (XID INT IDENTITY, MyFinancialValue BIGINT) GO -- Multiply by the Scale of the value INSERT FinancialExample (MyFinancialValue) VALUES ( 7546.9089 * 10000) Select * from FinancialExample -- Divide by the scale of the value Select CAST(MyFinancialValue As Money) / 10000 as OriginalValue From FinancialExample
Integer storage and retrieval is generally faster than money or decimal, but this approach requires additional logic to correctly store and pull the values and should only be considered in extremely intensive OLTP scenarios.
Storing Character Data
In SQL Server, the N datatypes (Nchar, NVarchar) store Unicode data. Unicode data types allow a much larger range of values to be stored, but require twice as much storage space as non-unicode (i.e., char, varchar) data types.
Unicode data types are generally used to store non-English characters (such as Japanese Kanji), as many of these are not supported by Char and Varchar.
In the following example, a variety of values were inserted into both non-Unicode and Unicode data types:
CREATE TABLE AllowableNonUnicode (Example Varchar(12) COLLATE Japanese_CI_AS_KS_WS) CREATE TABLE AllowableUnicode (Example Nchar(12) COLLATE Japanese_CI_AS_KS_WS) Insert AllowableNonUnicode (Example) VALUES (N'a'), (N'漢'), (N'Å'), (N'Ø'), (N'字') Insert AllowableUnicode (Example) VALUES (N'a'), (N'漢'), (N'Å'), (N'Ø'), (N'字') Select * from AllowableNonUnicode Select * from AllowableUnicode
At first glance, it looks like all values were stored correctly. But closer inspection reveals that several of the values were modified by SQL Server during the insert process. The Varchar field was unable to store the values accurately.
As a rule, Unicode data types should be used in any situation where non-English characters are possible. This especially applies where such an implicit conversion could cause problems within the database or calling application. With relatively cheap storage and the availability of NVarchar to save space, Unicode data types are preferred to prevent conversions and codepage incompatibilities.
For really large strings, you should use Nvarchar(MAX) and Varchar(MAX) types. NVarchar(MAX) and Varchar(MAX) were introduced in SQL Server 2005 as replacements for the now-deprecated and much-maligned TEXT and NTEXT data types. Many SQL operations and functions will not operate on TEXT and NTEXT, and these types should never be used in any new development.
NVarchar(MAX) and Varchar(MAX) are intended for storage of large strings, hence their classification as large-value data types. Although there is technically a limit (around 2GB of space for a single field), in practical use these datatypes can store a string of any length.
For these types, the first 8KB is stored in the data page; any additional data is stored in an overflow page. This overflow is transparent to the user.
Mixing datatypes, both when copying data inside SQL Server and between SQL Server and the client, can lead to implicit conversion (Brent Ozar has a long write-up here). This occurs when SQL Server has to translate data from one datatype to another, and can do so according to conversion rules. Despite not generating an error message, this causes an additional operation for each row processed, and – more importantly – can lead to loss of data precision.
The currently accepted recommendation is to favor Unicode (N* datatypes) where storage is not as critical, or users are capable of entering data without restriction.
If space is at a premium and data entry is known or controlled, standard character datatypes can be safely used.