Logical Data Modeling - (Surrogate | Substitute | Synthetic | Generated ) Primary key

About

A surrogate key is a substitute primary key for when:

  • the data entity are created in distributed way
  • you don't have access to a central entity such as database to create a simple sequence
  • you don't have a natural one (entity not found for instance)
  • the key is too long for the database (example: url for a web page)

Since the value has no meaning, there is unlikely to be any demand for a change making them true primary key.

Surrogate means to choose in place of another

It's a primary key that holds a generated value.

A surrogate key exists almost always next to a business key.

Example

  • An ID column populated via a database sequence is known as a surrogate key. - It works only when data is being inserted into only one database (not many databases at the same time)
  • A ObjectId - A random hash-based generated value for Json object
    • of a long natural key (such as an URL)
    • of all values of the record
SELECT * FROM tbl_name
  WHERE surrogate_pk=MD5(urlValue)
  AND url=urlValue;

Usage

No key

If there is no logical unique and non-null column or set of columns

Space

Numeric values also produce shorter values of predictable length. Storing numeric codes in place of string is more space-efficient.

Dimensional Modeling

The use of SKs in dimensional modeling is not only simplicity of star design or performance.

You often need to populate dimensions with extra records not found in the source table, such as:

  • or “NOT APPLIES” records.

If you don't use SKs, What PK value would you use for this extra records?

How can you be sure that the values you choose won't collide with future records from the source table?

Also, taking your dimensional schemas away from possible source key changes is a good idea. Of course creating and maintaining SKs adds complexity and is extra work for ETL processes.

Characteristic

Sortable by Time

If the id are sortable by time. a list entity could be sorted without fetching extra information.

Storage

IDs should ideally be short enough.

Database system are generally limited at 64 bits to store integer

Collision

collision probability should be acceptable

Implement

Surrogate key may be implemented via:

Database Sequence

A sequence (auto-incrementing number) in a database is suitable in a centralized architecture. If the id can be generated decentralized, you need to add more information to make them unique (ie the time and node name for instance as instagram does)

Server-side generated identifiers are pretty much guaranteed to be unique.

GUID

Via the standard guid outside a database (distributed)

Bootstrap

Bootstrap implements it as a completly random number 1)

const MAX_UID = 1000000
const getUID = prefix => {
  do {
    prefix += Math.floor(Math.random() * MAX_UID)
  } while (document.getElementById(prefix))

  return prefix
}

Generated / Object Id

Generated outside a database:

In datawarehouse, they may implements a not/so real surrogate key as a hash of the value of all columns.

Fingerprint

device fingerprinting is also a generated way to identify an individual.

NanoId

Nanoid is a random ID string comparable to UUID v4 (random-based)

  • URL friendly
  • 21 characters (vs 36 with UUID because of a bigger alphabet).
  • Polyglot: 14 languages

If the number of id generated is small, you can reduce the size: Collision probability by size

const { nanoid } = require('nanoid');
nanoid(15); //=> "YrxC0J7fBQobaL-"

Id Protection / Obfuscation

HashId

  • Number to text with salt: https://hashids.org/ - YouTube-like ids from numbers - Use it when you don't want to expose your database ids to the user.

Example:

let minimalHashLength = 8;
let hashAlphabet = "abcdefghijklmnopqrstuvwxyz1234567890";
let hashids = new Hashids("this is my salt", minimalHashLength, hashAlphabet)
let encode1 = hashids.encode([1]);
let decode1 = hashids.decode(encode1);
console.log(`The value 1 encoded is: ${encode1}`);
console.log(`The encoded value ${encode1} decoded is: ${decode1}`);
  • Hashid will return an empty array if it can't decode
let badvalue = "badvalue";
let decodedBadValue = hashids.decode(badvalue);
console.log(`The bad value (${badvalue}) decoded is an empty array: ${decodedBadValue}`);
  • You can encode multiple value
let multipleValues = [1,2];
let multipleValuesEncoded = hashids.encode(multipleValues);
console.log(`Mutliple values (${multipleValues}) are encoded to ${multipleValuesEncoded}`);
let multipleValuesDecoded = hashids.decode(multipleValuesEncoded);
console.log(`Mutliple values encoding (${multipleValuesEncoded}) are decoded to ${multipleValuesDecoded}`);
  • Because you can encrypt by part, you can also encode an Uuid
let uuidString = uuid.v4();
let uuidByteArray = uuid.parse(uuidString); // 16 bytes array
let dataViewByteArray = new DataView(uuidByteArray.buffer, 0);
let uuidTimeLow = dataViewByteArray.getUint32(0);
let uuidTimeMid = dataViewByteArray.getUint16(4);
let uuidTimeHighAndVersion = dataViewByteArray.getUint16(6);
let uuidClokSeq = dataViewByteArray.getUint16(8);
let uuidNode1 = dataViewByteArray.getUint32(10);
let uuidNode2 = dataViewByteArray.getUint16(14);
let uuidHashed = hashids.encode([uuidTimeLow,uuidTimeMid,uuidTimeHighAndVersion,uuidClokSeq,uuidNode1, uuidNode2]);
console.log(`The uuid ${uuidString} has been encoded to ${uuidHashed} of length ${uuidHashed.length}`);

Optimus

Optimus:

  • obfuscates id number into integer id (no text)
  • is based on Knuth's integer hash method

Implementation:

Documentation / Reference


Powered by ComboStrap