Node ID System: Engineering Deep Dive

The "Reference Problem" is the core challenge of any mutable system. How do you point to an object that is constantly moving? This document explores reference stability, collision probability, and the performance cost of UUIDs in high-frequency text editing.

The Problem of Reference

In a static document, addressing is trivial: "The 5th paragraph." In a dynamic editor, especially with real-time collaboration, indices are meaningless.

If User A deletes Paragraph 1, then "Paragraph 5" instantly becomes "Paragraph 4" for User A, but remains "Paragraph 5" for User B until the network packet arrives. This Index Divergence is the root cause of 90% of collaboration bugs.

Mutable References (Paths)

  • Address: [0, 4, 1]
  • Pro: ZERO memory overhead.
  • Con: Invalidated by any precedent sibling mutation.
  • Use Case: Local rendering, simple scripts.

Immutable References (IDs)

  • Address: "b9f2-k8l1"
  • Pro: Mathematically stable forever.
  • Con: High memory cost (Reference storage).
  • Use Case: Collaboration, React Keys, Comments.

1. Path Addressing (Mutable)

Path addressing treats the document as a Coordinate System. A node is defined by its position in the tree hierarchy.

// A "Path" is a trail of array indices
type Path = number[]; 

// Example: Root -> 1st Child -> 4th Child
const target = [0, 3];

The Volatility Problem

Scenario: You are tracking a comment on Paragraph 10 path: [10].

Event: User deletes Paragraph 0.

All paths > [0] must be decremented.
Your comment now points to [10], which is actually the OLD Paragraph 11.

To make Paths work in a real editor, you need Operational Transformation (OT). You must mathematically "Transform" every single stored path against every incoming operation. This is computationally expensive (O(History_Length)) and prone to subtle bugs.

2. ID Addressing (Immutable)

ID addressing decouples Identity from Position. A node is defined by "Who it is", not "Where it is".

UUID Collision Mathematics

Developers often fear "What if two IDs collide?". Let's look at the math for UUIDv4 (122 random bits).

ID Type Bits Collision Risk (at 1B nodes)
Incrementing Int 32/64 100% (in distributed systems)
ShortID (8 chars) 48 Rare (0.01%)
UUIDv4 122 Negligible ($$10^-18$$)

Never use Auto-Increment Integers. If User A creates "Node 5" offlin and User B creates "Node 5" offline, you have an unresolvable ID conflict upon merge. Distributed systems require a standard random distribution.

Performance Analysis (V8)

While IDs are mathematically superior, they incur real machine costs.

String Interning & Memory

In V8 (Chrome/Node), strings are immutable. Heavy usage of string IDs can put pressure on the heap.

  • Object Shape Overhead: An object property key `node["id_123"]` is often "Interned" (cached globally). If you have 100,000 unique IDs, you have 100,000 entries in the V8 String Table.
  • Pointer Size: A sequential integer fits in a "Smi" (Small Integer, stored directly in the pointer). A string ID requires a pointer to a Heap Object (64-bit pointer + Object Header + String Data).
// MEMORY COST COMPARISON
// Integer ID: 0 bytes overhead (Stored in pointer)
const id = 1275; 

// String ID: ~40-60 bytes overhead
// (Pointer + Map Reference + Hash + Char Data)
const id = "9b1deb4d-3b7d-4bad-9bdd-2b0d7b3dcb6d";

*Mitigation:* Modern editors often use NanoID (URL-friendly, shorter) or Base64 encoding to reduce the byte storage size compared to raw hex UUIDs.

Implementation Patterns

1. The Reverse Lookup Map

Clicking a DOM node and finding the corresponding Model Node is a critical "Hot Path" (runs on every click/hover). Naive search is O(N). We need O(1).

class Editor {
  // Using a WeakMap prevents memory leaks!
  // If the DOM node is removed, this entry is GC'd.
  private domToModel = new WeakMap<HTMLElement, NodeID>();

  registerNode(domNode: HTMLElement, modelId: NodeID) {
    this.domToModel.set(domNode, modelId);
    // Also handy:
    domNode.dataset.id = modelId; 
  }

  getIdFromDom(domNode: HTMLElement): NodeID | null {
    // Allows O(1) lookup from Event.target
    return this.domToModel.get(domNode);
  }
}

2. Hybrid Addressing (Standard)

Most professional editors uses Dual Addressing.

  1. Storage uses IDs: The document model (JSON/CRDT) stores only IDs. This ensures sync safety.
  2. Selection uses Paths: The caret is defined as `(NodeID, Offset)` OR `(Path, Offset)`. Selection is ephemeral and local, so using Paths (which are faster to compute during typing) is acceptable, provided they are re-validated against IDs before saving.

3. ID Generation Algorithms

Not all IDs are created equal. The choice of algorithm has profound implications for storage size, database performance, and network bandwidth.

Compact NanoID

NanoID uses a larger alphabet (A-Z, a-z, 0-9, -, _) to pack more entropy into fewer characters. This is the recommended choice for Text Editors.

UUID (36 chars)
f47ac10b-58cc-4372-a567-0e02b2c3d479
NanoID (21 chars)
V1StGXR8_Z5jdHi6B-myT

40% Smaller Payload. For a document with 1000 nodes, this saves ~15KB of JSON compared to UUIDs. Over a websocket, this adds up.

Time-Sortable ULID & CUID

Universally Unique Lexicographically Sortable Identifier. These IDs start with a Timestamp and end with Randomness.

// ULID Structure
"01ARZ3NDEKTSV4RRFFQ69G5FAV"
// [ Timestamp  ] [ Randomness ]

// Benefit: Newer nodes always sort AFTER older nodes
const sorted = nodes.sort((a, b) => a.id.localeCompare(b.id));
// ^ This gives you creation order for FREE!

When to use: If you store Document versions or snapshots in a database, use ULIDs for the Snapshot ID. But for internal node IDs within the document, plain NanoID is sufficient since order is determined by the document structure.

4. Database Engineering Impact

If you persist documents to a database, your ID choice can make or break query performance.

B-Tree Index Fragmentation

Relational Databases (PostgreSQL, MySQL) use B-Tree indexes for Primary Keys. B-Trees are optimized for Sequential Insertion.

Sequential IDs (1, 2, 3...)
  • New rows append to the end of the index.
  • Pages stay full and compact.
  • Buffer cache hit ratio: High.
Random UUIDs
  • New rows insert at random positions.
  • Constant Page Splits.
  • Buffer cache hit ratio: Low.
  • Write Amplification: 2-5x.

Solution: Use ULID or CUID for database Primary Keys (they are time-sortable), or use a separate auto-increment internal_id for the PK and store the random ID as a secondary indexed column.

5. Frontend Architecture

On the frontend, stable IDs are essential for performance and UX.

React Reconciliation & Key Prop

React uses the key prop to identify which virtual DOM nodes correspond to which real DOM nodes.

Bad: key={index}

If you insert a paragraph at the top, every key shifts. React destroys and re-creates all components. Focus is lost. IME state is reset.

Good: key={node.id}

The node's key remains stable. React sees the new ordering and moves the existing DOM nodes. Focus and IME are preserved.

Virtualization & Scroll Anchoring

For documents with 1000+ nodes, you must use Virtualization (only rendering visible nodes). Libraries like react-window or tanstack-virtual rely on stable IDs.

// Virtualization with stable IDs
const sizeCache = new Map<NodeID, number>();

function onNodeResize(id: NodeID, height: number) {
  sizeCache.set(id, height);
}

// When the user scrolls back up, we know the EXACT height
// of off-screen nodes. No scroll jumping!
function getItemSize(id: NodeID): number {
  return sizeCache.get(id) ?? ESTIMATED_HEIGHT;
}

Without stable IDs, if content above the viewport changes, the scroll position jumps erratically because heights cannot be cached.

6. Security Implications

IDs are often exposed in URLs. Your choice of ID format has security implications.

URL Pattern Security Risk
/doc/1055 Enumeration Attack. Attacker iterates through /doc/1056, /doc/1057... and scrapes your entire database.
/doc/b9f2... Secure. The search space is too large to brute-force. Cannot guess adjacent document IDs.
ULID timestamps Information Leakage. ULIDs reveal creation time. May be acceptable or not depending on your use case.

Best Practice: Use random IDs (NanoID/UUID) for public-facing URLs. If you need time-sorting internally, use ULID for database PKs but expose a random public_id in the API.

7. Real-world Editor Analysis

How do popular open-source editors handle node identity?

ProseMirror

Uses Path-based addressing with Mapping for OT. Nodes do not have inherent IDs. This makes it simpler but requires careful transform logic for collaboration plugins.

Slate.js

Uses Path-based addressing. Nodes can optionally have IDs, but it's up to the developer to add them. Most real-world Slate apps add id to every Element.

Lexical (Meta)

Uses ID-based addressing with internal __key property on every node. This is used for React reconciliation and node lookup. The key is a simple incrementing integer per editor instance (safe because it's local, not synced).

Notion / BlockSuite

Uses CRDT-based ID addressing. Every block has a stable ID that is used for real-time sync. The underlying CRDT (Yjs) manages conflict resolution using Lamport Timestamps embedded in the ID.