Natural Keys or Surrogate Keys

Should a table always have a surrogate key? Is it really best practice, or should we stick to natural keys? Let’s dig in to see if there’s one clear answer.

Quick Definitions

A natural key is made up of naturally occurring fields in the table.

A surrogate key is a meaningless value added to uniquely identify a row. Usually this is an identity or ‘autonumber’ column.

[For detailed examples please see A Key to Keys]

Do we really have to choose?

Every table has a natural key. If there isn’t one, then duplicate rows can occur, and the table is not in first normal form. It may not be obvious, but every table has one. Try identifying the natural key in this table:

ErrorLog table diagram

In this case, the natural key might be TimeStamp, or it may be {TimeStamp, Message}. It’s clunky, and it may not even be enforced as a primary key (remember a PK is not required for any normal form – See First Normal Form (1NF)) but the natural key is there.

So we really don’t have to choose. Every table has a natural key, and if you aren’t sure what it is you should try to identify it (worst case scenario, it will be a set of all columns). Our only question is, “Should we add a surrogate key?”

What can a surrogate key do for us?

Simplicity

Sometimes natural keys can be very large. In the tables below, two suppliers may use the same document number, or a single supplier my re-use a document number, so our candidate natural key is {Supplier, DocumentNumber, ShippingDate}.  Any table that references InvoiceHeader will have to inherit all three of these attributes, and subsequent children will have to inherit even more:

Natural Keys Cascade DIagram

The more child tables we create, the more cumbersome it can be to write queries or manipulate data.

Surrogate keys can simplify this model:

Surrogate Keys Diagram

Space savings

In the first model above, if we assume the worst case scenario (every byte of those varchars is used), and we have a million invoice lines, the size of InvoiceLine, just for the foreign key columns, would be 67 MB. If we use the integer key it would be just under 4 MB.

Surrogate keys can quickly add up to a lot of space savings.

Clustering

Natural keys can be a clustered index, but they are not always a good choice for clustering. They will be prone to page splits, which will slow down transaction processing. An auto-incrementing integer can be efficiently clustered, will not incur page splits, and makes for optimal joins to referencing tables.

If you use a non-integer data type (such as a guid) for a surrogate key you may have the same problems with clustering. I advise to always use a simple incrementing integer unless there is a very good reason not to.

You can cluster guids if you utilize the newsequentialid() function in SQL Server, but integers are still preferred.

Performance

We have seen that surrogate keys can save a lot of space. In our example above, we saved 63 MB on the InvoiceLine table. That’s 63 less MB that any query has to scan, join, filter, or compare to do whatever it needs to do.

Surrogate keys can reduce the number of fields to compare to perform a join. It takes more processor operations to compare multiple fields per row than it does to compare one. Consider the join between InvoiceHeader and InvoiceLine in the examples above. Which model do you think would take less IO, less memory, and less processing cycles?

The same is true with strings, but worse. Internally strings are long series of numbers (one per character), each of which must be compared. And is your collation case-insensitive? Your database will have to factor for collation settings in order to compare strings. On small tables the performance loss may be negligible, but on large tables the difference will become obvious.

Consistency

In the first version of the invoice tables, if these values are coming from batch files provided from an outside source, the key fields may be reliably consistent. But what if these value were hand-entered from a UI of some sort? If someone enters an entire invoice, then realizes they picked the wrong supplier, or there was a typo in DocumentNumber, then all three tables must be updated to correct the error. That update can be a tricky and expensive operation!

When should we use surrogates?

There’s not really a negative to surrogate keys (they will not affect how ‘normalized’ your table is), but I always lean towards what’s practical. Often surrogates are used only because they are considered best practice. This is fine, but I try to be deliberate with every column. Why is it there? What does it mean? How will it be used? If the surrogate key does not serve one of the purposes above, I would not add it unless the company standards demand it.

You’ll notice in my simplified example above, I did not add a surrogate to InvoiceLineFee. Why? Simply because there are no tables that reference it. It will not simplify any child tables, it will not save any space (in fact it will take more), it will not be used for joins and it does not offer a performance advantage. Unless a system needs to reference a single invoice line fee, without any corresponding information, it does not ‘need’ a surrogate key.

(Notice that the natural key for InvoiceLineFee {InvoiceLineId, FeeType} now actually contains the surrogate key from InvoiceLine. This is acceptable, and is still a ‘natural’ key.)

Your company might still want a surrogate added to InvoiceLineFee for the sake of consistency, and that’s not wrong, it just may not be necessary.

A case for natural keys

Surrogate keys can change! Consider the table below, where UPC is the natural key and was chosen as the PK:

Consistent natural key example

This is a consistent natural key. The meaning of the UPC may change as manufacturers reuse them over time, but we don’t have to worry about the UPC itself changing. If it was updated, it would be the equivalent of a delete and a new row.

We could add a surrogate key (and this may be good). But when someone needs to update an item’s information, instead of updating, they may choose to delete it and re-enter it. This might be a valid method for the user, depending on UI flow and the complexity of updating an item vs. ‘rebuilding’ it. In this case the UPC now has a different surrogate key.

A surrogate key might be good for referential integrity in the originating system, but in downstream systems we should not tie data to the surrogate key! Downstream systems should instead utilize the natural key so that their records do not lose context.

Other thoughts?

Did I miss anything? If you can think of another situation where a surrogate key is useful, or specifically harmful, please comment below!

2 comments

  1. (Assuming here that all surrogate keys are Auto-Increments)

    Business keys that are human-seen are important, for the same reason as the UPC, and most importantly tie-in with a future Data Warehouse or Data Vault, which will generate new surrogate keys.

    Tables that are a “concept” table, like customer, UPC, Product, Invoice – anything “unique” that a human sees – should exist.
    As a consultant, I’ve often seen companies using the Auto-Inc PK as the business key, which is a mistake IMHO.
    – A business key is unique to the table and human-seen. (it might be more than one field)
    – A surrogate key is for the DBAs & programmers to use

    Thus my opinion that ALL tables have a surrogate key, even when there is a unique business key. The UPC example is a great one.

    For example, surrogate keys are great for being able to control merges within an SSIS job. It is an indirect timestamp.

    Like

    1. “As a consultant, I’ve often seen companies using the Auto-Inc PK as the business key, which is a mistake IMHO.”

      I don’t think that’s an opinion, I think it’s a fact! Over and over again I’ve seen systems suffer difficult to diagnose bugs and abnormalities due to this. And as a matter of fact, I learned it the hard way myself. On a table for UPCs, actually. Let’s just say I didn’t come up with that example by chance. :)

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s