What is a Surrogate Primary key? known also as Substitute, Synthetic or Generated Key - Logical Data Modeling -

Data System Architecture

About

A surrogate key is a substitute primary key for when:

  • the data entity are created in distributed way
  • you don't have access to a central entity such as database to create a simple sequence
  • you don't have a natural one (entity not found for instance)
  • the key is too long for the database (example: url for a web page)

Since the value has no meaning, there is unlikely to be any demand for a change making them true primary key.

Surrogate means to choose in place of another

It's a primary key that holds a generated value.

A surrogate key exists almost always next to a business key.

Example

  • An ID column populated via a database sequence is known as a surrogate key. - It works only when data is being inserted into only one database (not many databases at the same time)
  • A ObjectId - A random hash-based generated value for Json object
    • of a long natural key (such as an URL)
    • of all values of the record
SELECT * FROM tbl_name
  WHERE surrogate_pk=MD5(urlValue)
  AND url=urlValue;

Usage

No key

A surrogate key needs to be generated, if there are:

  • no logical unique and non-null column set of columns,
  • The key value may change.

Space

Numeric values also produce shorter values of predictable length. Storing numeric codes in place of string is more space-efficient.

Dimensional Modeling

The use of surrogate keys (SKs) in dimensional modeling is not only the simplicity of star design or performance.

You often need to populate dimensions with extra records not found in the source table, such as UNKNOWN or NOT APPLIES records

These records do not have any natural identifier.

Characteristic

Sortable by Time

If the id are sortable by time, a list entity could be sorted without fetching extra information.

Example:

Storage

IDs should ideally be short enough.

Database system are generally limited at 64 bits to store integer

Collision

collision probability should be acceptable

Implement

Surrogate key may be implemented via:

Database Sequence

A sequence (auto-incrementing number) in a database is suitable in a centralized architecture. If the id can be generated decentralized, you need to add more information to make them unique (ie the time and node name for instance as instagram does)

Server-side generated identifiers are pretty much guaranteed to be unique.

GUID

Via the standard guid outside a database (distributed)

Bootstrap

Bootstrap implements it as a completly random number 1)

const MAX_UID = 1000000
const getUID = prefix => {
  do {
    prefix += Math.floor(Math.random() * MAX_UID)
  } while (document.getElementById(prefix))

  return prefix
}

Generated / Object Id

Generated outside a database:

In datawarehouse, they may implements a not/so real surrogate key as a hash of the value of all columns.

Fingerprint

device fingerprinting is also a generated way to identify an individual.

NanoId

Nanoid is a random ID string comparable to UUID v4 (random-based)

  • URL friendly
  • 21 characters (vs 36 with UUID because of a bigger alphabet).
  • Polyglot: 14 languages

If the number of id generated is small, you can reduce the size: Collision probability by size (See also Collisions of Hash or Identifier Generation)

const { nanoid } = require('nanoid');
nanoid(15); //=> "YrxC0J7fBQobaL-"

Ulid

Universally Unique Lexicographically Sortable Identifier

https://github.com/ulid/spec

Canonical String Representation

ttttttttttrrrrrrrrrrrrrrrr

where

  • t is Timestamp (10 characters)
  • r is Randomness (16 characters)

K-Sortable Globally Unique IDs

https://segment.com/blog/a-brief-history-of-the-uuid/

Documentation / Reference





Discover More
Card Puncher Data Processing
Analytics - User Id (Person identity)

This page is a user identity (id). A person’s identity can be: anonymous (we don’t know anything this person yet), or named (we know something them such as an email address.) The most...
Data Modeling Chebotko Logical
Cassandra NoSql Database

Cassandra is a NoSql database for transactional workloads that require high scale and maximum availability. Cassandra is suited for transactional workloads at high volume and shouldn’t be considered...
Odds Of A Hash Collision
Collisions of Hash or Identifier Generation

A collision happens when a function produces the same result while it should be unique. In a hash function, it happens when two different inputs produces the same hash In a identifier generator, it...
Example Date Dimension
Dimensional Data Modeling - Date Dimension

The purpose of a date dimension is to support interpretation of the date. Different areas of the business will have different views of the calendar. A date dimension is critical in supporting such interpretations....
Data System Architecture
How to protect numerical Id ? (obfuscate, id encryption)

This page is how to obfuscate / protect a internal id. So that even if they are public, they cannot be guessed. This is mostly used to prevent guessing your data. For instance, you can estimate...
JSON - ObjectId

ObjectId are generated identifier (known as surrogate) with the intent to be unique for a Json. ObjectId are custom UUID that are created from: a counter timestamp (milliseconds) node id (IP...
Data System Architecture
Logical Data Modeling - Business Key

A business key is a key that use the business users to identify uniquely an entity. It's then also a unique key. The business key is a meaningful/business code based generally on the concatenation of...
Data System Architecture
Logical Data Modeling - Global Identifier (Fully Qualified Name, Canonical Form)

A global identifier identify uniquely a primary element (entity or relationship) in the global scope It's also generally known as: the fully qualified id the fully qualified name the canonical...
Data System Architecture
Logical Data Modeling - Id attribute

An id is an label attribute with a numeric representation that identify uniquely an entity or a relationship in a local scope. name If the id attribute is chosen as the local identifier, it will be:...
Data System Architecture
Logical Data Modeling - Key

A key refers to a attribute, or a group of attributes, which assumes a unique value for: each entity in a entity set. or relationship in a relationship set. keyidentifier... database primary...



Share this page:
Follow us:
Task Runner