Thursday, May 14, 2015

What are the reasons for introducing a surrogate key to a dimension table?

When designing a dimension table such as Product and Employee in a data warehouse, a surrogate key is introduced for implementing an unique column for identifying each row uniquely in the entity. Even though we usually see this with almost data warehouses, it is not a must to implement a surrogate key. If the key of the source table related to the entity can be used as the key column of the dimension table, there is no necessity to introduce another column. How do we decide? What are the factors for deciding whether we need a new surrogate key for the dimension table or we can use the source key as the key column in the dimension table?


Here are some factors that can be used for deciding whether you need a new surrogate key (or artificial key) for your dimension table;

  1. Key of the source is a wide string such as CUS01X5679. It is not recommended to use a string data type as the dimension key unless it is fixed and small.
  2. Key of the source is a GUID. This is a 16-bytes value and it is too lengthy for the dimension key. In addition to that, if GUIDs in the source are not sequentially generated, it is not good for the key at all.
  3. Dimension table has to be loaded from multiple data sources. If table has to be populated from multiple sources, there is a chance to get same key from two differences sources for same record. This make impossible to make the source key as the dimension key because two records cannot be added with the same key.
  4. Dimension table is a SCD type 2 dimension. If the dimension table is a Slowly Changing Type 2 dimension, it needs to maintain same record with multiple versions addressing changes and maintaining the history. If the sources key is used as the dimension key, it is not possible to make another version for the same record with the same key.
  5.  Source reuses the key for new items. Some systems allow to delete records from entities and insert new records with the same keys that were used with the deleted records. Since the old record is already taken to dimension table, it is not possible to insert the new record with the same key if source key is the dimension key.
If you do not see any of reasons mentioned above, no harm, you can use the source key as the dimension key.

No comments: