Sunday, April 6, 2014

Reasons for adding a surrogate key for dimension tables

As a best practice, we always add a new key, known as a surrogate key to dimension tables for identifying the records uniquely. However, generally, the unique business key becomes the primary key of relational tables related to business entities but once the surrogate key is added to dimension tables, we make it as the primary key without reusing the business key. Here is an example from AdventureWorksDW data warehouse.

image

Surrogate key is an integer that is assigned sequentially when records are added. There are many other names for this such as artificial key, synthetic key, meaningless key, integer key or non-natural key. Do we really need this? Does it seem inappropriate using the business keys for primary keys? It may seem sensible for reusing, but here are some valid reasons for not using business keys but using surrogate keys;

  1. Dimension is populated from multiple sources
    In many cases, dimensions are populated from multiple sources. This always introduces incompatibility between business keys’ data types originating from sources. Not only that, the uniqueness of the business key is not always guaranteed too.
  2. Business key is reassigned when an object becomes obsolete
    OLTP systems always focus on current operations giving less importance to the history. If an object becomes inactive or obsolete (Example: Organization decides to discontinue a product), there is a high probability to get the business key used assigned to a new object. If the business key is used as the key of the dimension, this becomes an issue because dimension already holds an object with the same business key.
  3. Business key is a complex string or GUID
    Business keys used in source systems could be formed combining numbers and different characters or with GUIDs. In such scenario, there is no harm of using the business key as the primary key of the dimension table as long as the uniqueness is guaranteed. However, in terms of performance, it makes more effective to have an integer surrogate key than a complex string.
  4. Integer improves the performance
    As mentioned with reason 3, it is always better to have an integer key as the key of the dimension even though the business key is a simple string. This improves the efficiency of the table and speeds up the data retrieval when joining with facts.
  5. Objects without business keys
    When populating dimension records, there could be a situation that object needs to be populated without a business key. Good example on this is, holding a newly introduced product in the Product dimension, before extracting it from the source, for supporting a transaction exist in the fact table. Another example is, holding an anonymous customer in the Customer dimension for supporting promotion events recorded in the fact table. It is become possible for holding these objects in dimension tables as “Unknown”, “Not applicable”, or “Anonymous” when surrogate key is used.
  6. Objects are versioned in dimension tables
    In order to maintain the history in dimension tables, same objects are versioned by adding multiple records (These dimensions are called as Slowly Changing Dimension Type 2). Example: Customer is recorded with his country Sri Lanka. Later customer becomes a citizen of Australia. In OLTP systems, it is just a change for country column. But in data warehousing, for maintaining the history for events recorded (facts), a new version (new record) of the customer is added with Australia for the country. This is not possible if the business key of the customer is used as the key of the dimension.

Of course, the maintenance cost gets increased and a mechanism has to be introduced for generating surrogate keys and linking them with foreign keys in fact tables. But it is not as difficult as many think because there are many built-in facilities available in DBMS and ETL systems.

No comments: