min left

Welcome back — continue from ?

Continue →
J K navigate sections
DEEP DIVE

How Google Docs Real-Time Collaboration Works — OT vs CRDTs, Architecture, and the Engineering Nobody Explains

By Akshay Ghalme·Updated April 16, 2026·22 min read
🔥 0 engineers found this useful

You and a coworker are editing the same Google Doc. You type "hello" at position 10. At the exact same moment, they delete the character at position 5. Neither of you knows about the other's change yet. Somehow, within a few hundred milliseconds, both of your screens converge on the same document with both changes applied correctly — not "both changes in some order," but "both changes in the only order that makes sense given what each of you intended." Now scale that to 100 people editing one doc, and 1 billion docs in existence. That is what Operational Transform solves, and the algorithm behind it is one of the most elegant pieces of distributed systems design ever deployed to regular users.

This post walks through how Google Docs actually makes collaborative editing work, grounded in the Jupiter paper from AT&T Bell Labs (1995) that introduced the specific OT variant Google Docs uses, public talks from Google engineers, and the decades of academic literature on collaborative editing. The algorithm is not secret. What is rare is an explanation that treats it as a real systems problem rather than a theory exercise.

The Naive Approach, and Why It Does Not Work

The first thing anyone designing a collaborative editor thinks is: "Just send the whole document every time someone changes anything." This breaks the moment you think about it:

  • A 100-page document being sent on every keystroke is absurd bandwidth
  • Two users sending overlapping document states creates an obvious race — whose save wins?
  • Latency means each user sees a stale version, and rebasing a whole document is a diff problem every round trip

The second idea is: "Send just the changes." Better — you now send a small patch like insert "hello" at position 10. But this has a subtle and devastating flaw.

Imagine the starting document is Hello, world!. Two users see it at the same time:

  • Alice deletes character at position 0 ("H"). She expects the result ello, world!.
  • Bob, simultaneously, inserts "!" at position 5 (between "o" and ","). He expects Hello!, world!.

Both users send their operations to the server. Alice's op arrives first. The server applies it: ello, world!. Now Bob's op arrives — insert "!" at position 5. But position 5 in the new document is a different character than position 5 in the document Bob was looking at when he made the change. If the server naively applies Bob's op, the "!" ends up in the wrong place. If it then broadcasts that result to Alice and Bob, Alice sees a garbled document. Bob sees something he did not type.

This is the core problem of concurrent editing: operations are defined relative to a document state, and that state drifts as other operations arrive. You cannot just send the ops and apply them in arrival order. Every operation needs to be adjusted — "transformed" — to account for the operations that happened concurrently with it.

Operational Transform — The Trick That Solves It

Operational Transform (OT) is a family of algorithms that defines, for every pair of operation types, a transformation function. The transformation function takes two concurrent operations and produces two new operations that can be applied in either order to produce the same final state. Formally, if op1 and op2 are concurrent, then:

apply(apply(state, op1), transform(op2, op1))
  ==
apply(apply(state, op2), transform(op1, op2))

Read that carefully. It says: if you apply op1 first and then a transformed version of op2, you get the same state as if you applied op2 first and then a transformed version of op1. This property is called TP1 (Transformation Property 1), and it is the mathematical foundation that makes OT work.

Back to our example. Alice's delete at position 0 and Bob's insert at position 5 are concurrent. When Bob's operation arrives at the server after Alice's has been applied, the server transforms it: since a character was deleted before position 5, the new position 5 in the updated document corresponds to the old position 4. Bob's op becomes insert "!" at position 4, and when applied to ello, world!, produces ello!, world! — exactly what Bob intended, at the correct character.

Notice what just happened. The server did not ask Bob to redo anything. It did not reject his operation. It transformed his operation to match the reality that arrived between when he typed it and when it reached the server. Alice sees the correctly-transformed result. So does Bob. Everyone converges. Nothing is lost.

The Jupiter Model — Google's Server-Centric Variant

There are many OT variants, differing in how they handle the hardest cases (nested concurrent operations, multi-user convergence, operations on rich text with formatting). Google Docs uses a variant called the Jupiter model, originally published by David Nichols, Pavel Curtis, Michael Dixon, and John Lamping at Xerox PARC and then Bell Labs in 1995.

The Jupiter model is intentionally simpler than fully-general OT because it makes one strong assumption: there is a central server that defines the canonical ordering of operations. Every client only has to reason about two operation streams: its own local operations and the stream of operations coming from the server. The server, in turn, only has to reason about each client's operations relative to the server's history. This drastically reduces the number of cases the transformation function has to handle, compared to a peer-to-peer OT system where every pair of clients must pairwise resolve conflicts.

This server-centric choice is the architectural decision that makes Google Docs tractable. A fully decentralized collaborative editor is theoretically possible, but the complexity balloons with the number of concurrent editors. Making one node authoritative — the server — linearizes the world into a single sequence that everyone else rebases onto.

The Full Client-Server Loop

Here is what actually happens when you type a character into Google Docs.

  1. Local application. The client immediately applies your keystroke to your local copy of the document. Your screen updates instantly. This is called optimistic application and it is what makes the editor feel responsive — there is no round trip before you see your own typing.
  2. Operation encoding. The client creates an operation — something like {type: "insert", position: 247, text: "a", author: "alice", base_revision: 1042}. The base_revision field is critical: it is the document revision number the client was looking at when it made the operation.
  3. Send to server. The operation is pushed to the server over the persistent connection (historically a long-poll, now typically WebSocket or similar) the client maintains for the doc.
  4. Server rebases. The server receives the operation with base_revision: 1042 but the server is now on revision 1048 — six operations from other users have arrived since the client's last sync. The server transforms the incoming operation against each of those six, one at a time, producing a version of the operation that makes sense against revision 1048.
  5. Server commits and assigns a new revision. The transformed operation is applied to the server's canonical document, producing revision 1049. The server's history now has one more operation in it.
  6. Server broadcasts. The transformed operation, along with its new revision number, is broadcast to every other client viewing the document.
  7. Clients rebase locally. Every receiving client checks what revision it thought it was on. If it matches the new operation's base, great — the operation applies cleanly. If not (because the client had its own pending ops in flight), the client transforms the incoming operation against its pending ops, applies the result to its displayed document, and updates its own pending ops so they match the new server state.

This loop runs on every keystroke, every cursor move, every bold-on-off toggle. At any given moment, every client has three versions of the document in memory: the last version it knows the server has confirmed, the version with all the operations it has sent but not yet heard back about, and the version the user is currently seeing. All three are continuously reconciled as the network delivers updates.

Here is the loop visualized:

sequenceDiagram participant Alice as Alice's Browser participant Server as Google Docs Server participant Bob as Bob's Browser Alice->>Alice: Type "hello" at pos 10
(apply locally, instant) Alice->>Server: insert("hello", pos:10, rev:1042) Bob->>Bob: Delete char at pos 5
(apply locally, instant) Bob->>Server: delete(pos:5, rev:1042) Server->>Server: Receive Alice's op first
Apply to canonical doc → rev 1043 Server->>Server: Receive Bob's op (rev:1042)
Transform against Alice's op
delete(pos:5) → delete(pos:5) ✓ Server->>Server: Apply transformed op → rev 1044 Server->>Bob: Alice's op (transformed for Bob's state) Server->>Alice: Bob's op (transformed for Alice's state) Alice->>Alice: Apply transformed Bob's op Bob->>Bob: Apply transformed Alice's op Note over Alice,Bob: Both screens converge
on identical document

The critical insight: neither Alice nor Bob ever waits for the server before seeing their own typing. The local application happens instantly (optimistic update). The server reconciliation happens in the background. The user never perceives latency unless the network itself is slow — and even then, their own keystrokes are immediate. This is why Google Docs feels fast even on poor connections.

Why This Is Not a Trivial Piece of Code

Describing OT in three paragraphs makes it sound easy. Writing an OT implementation that survives adversarial real-world inputs is famously hard — hard enough that there are academic papers specifically about bugs in published OT algorithms. Some of the things that make production OT difficult:

Every operation pair needs a correct transform function

Insert vs insert, insert vs delete, delete vs delete, format vs insert, format vs format, list-bullet vs paragraph break, image embed vs text delete — the number of cases scales roughly quadratically with the number of operation types. Each case needs a correct transform, and every case needs to preserve TP1 (and sometimes TP2, a stronger convergence property). Get any one case wrong and the document eventually diverges between clients.

Rich text operations are not atomic

When you paste a 500-word block of formatted text, that is not one operation — it is dozens. Inserts, format ranges, style applications, list conversions. Each has to be transformable against any concurrent operation on the same text region. This is why the vast majority of OT implementations fail on corner cases: the combinatorial space of "what if this weird paste happens at the same time as that weird format change" is enormous.

Ordering without global consensus

The server assigns sequence numbers, but the network delivers messages out of order. Clients have to be able to accept operations in any order, buffer the ones they are not yet ready for, and apply them as soon as their dependencies arrive. This is a distributed systems problem on top of the OT problem.

Undo and redo across users

Your "undo" should undo your last action, not the action someone else made after you typed yours. Implementing a user-local undo stack in an OT system that has rebased your operations multiple times is surprisingly subtle — your "last op" may have been transformed several times and no longer means what you thought it meant.

Presence — Cursors, Selections, and the Easier Channel

Document operations are only one kind of real-time state in Google Docs. The other is presence: who else is in the doc, where their cursor is, what they have selected. Presence is much easier to implement because it does not need convergence guarantees — nobody cares if a cursor position is stale by 100ms.

Presence runs on the same persistent connection as operations but through a separate pub/sub channel. When you move your cursor, the client sends a lightweight update — "my cursor is now at position 423" — which the server fans out to everyone else in the doc. No OT, no transformation, no history. If two updates arrive out of order, the newest one wins because presence is ephemeral.

This separation of concerns is a general pattern worth knowing: not all real-time state needs the same consistency guarantees. Split your channels by the strictness of their convergence requirements. Document content needs OT. Cursors need pub/sub. Chat messages may need yet another model (append-only log). Treating them the same is how you end up with a complicated system that is over-engineered for the easy cases and under-engineered for the hard ones.

CRDTs — The Alternative Google Did Not Choose (and Why)

In the last decade, a different approach to collaborative editing has become popular: Conflict-free Replicated Data Types, or CRDTs. A CRDT is a data structure designed so that concurrent updates can be merged without needing a central server or a transformation function. Every update carries enough metadata (typically vector clocks or unique ID tags) that any two replicas can compute the same merged state independently.

CRDTs are elegant. They work peer-to-peer. They do not need a central authority to linearize operations. Figma uses CRDTs. Automerge uses CRDTs. Yjs uses CRDTs. Linear uses CRDTs. Why did Google not?

The honest answer is that Google Docs was built in 2006, before modern CRDT theory was mature. By the time CRDTs became production-ready, Google already had an enormous investment in Jupiter-model OT and a large body of working code. The cost of migrating to CRDTs — retraining the team, rewriting the algorithms, re-verifying correctness against real-world usage — was far higher than the marginal benefit. Operational Transform works, it has worked for over 15 years, and there is no user-visible problem that demands a switch.

For new systems today, CRDTs are often the right choice, especially if you want offline-first or peer-to-peer collaboration. But OT is not wrong — it is a different trade-off, and at Google's scale with a server-centric architecture, Jupiter-model OT is a legitimate production choice with decades of battle-testing.

OT vs CRDT — Complete Comparison

Since this is the most common question engineers ask when designing a collaborative system, here is the full comparison:

Dimension Operational Transform (OT) CRDTs
Central server required? Yes (in Jupiter model) No — works peer-to-peer
Offline support Limited — needs reconnect + rebase Excellent — merge on reconnect is automatic
Implementation complexity High — transform functions for every op pair Moderate — but metadata overhead is high
Memory overhead Low — only current state + pending ops Higher — tombstones, vector clocks, unique IDs per character
Correctness verification Hard — TP1/TP2 properties for every op combination Easier — mathematical commutativity guarantees
Rich text support Mature — decades of production use Improving — Yjs and Peritext are getting close
Used by Google Docs, Google Sheets, Etherpad Figma, Linear, Notion (partially), Automerge, Yjs
Best for Server-centric architectures with strong consistency needs Offline-first, local-first, peer-to-peer apps
Scalability pattern Scale the server (sharding by document) Scale is inherent (no server bottleneck)

The one-sentence decision: if you have a server and your users are mostly online, OT is simpler to reason about. If you need offline-first or peer-to-peer, CRDTs are the only option that works without a server in the critical path.

How Google Docs Scales to 1 Billion Documents

The OT algorithm solves the correctness problem. Scaling it to every Google account on earth is a separate problem entirely. Here is what that looks like architecturally:

Document Sharding

Every Google Doc has a unique ID. The server fleet is partitioned by document ID — each document's operation stream is owned by exactly one server process at any time. This is the same pattern as Kafka partition leadership or DynamoDB partition keys: shard by the natural entity boundary so that all operations for one document are linearized by one process.

Operation Log as the Source of Truth

The server does not store the "current document." It stores the operation log — the complete ordered sequence of every operation ever applied. The current document is a materialized view computed by replaying the log. This is event sourcing applied to collaborative editing. It gives you full audit history, unlimited undo depth, revision history ("see all versions"), and the ability to rebuild the document from scratch if the materialized view ever becomes corrupted.

Snapshot + Log Compaction

For a document with 10,000 edits, replaying the full log on every load would be slow. Google Docs periodically takes a snapshot — a frozen copy of the document at a specific revision number — and serves it alongside only the operations after that snapshot. When you open a doc, the server sends the last snapshot plus the tail of the operation log. This is the same pattern as database WAL (write-ahead log) + checkpoint.

The Connection Layer

Each client maintains a persistent bidirectional connection (historically long-polling, now likely gRPC-web or WebSocket). The server pushes operations to connected clients as they arrive. For a document with 50 concurrent editors, that is 50 open connections to the same document shard. The fan-out is cheap because operations are tiny (a few hundred bytes each), but the connection state is the most expensive resource at scale — it is why Google runs this infrastructure on their own custom serving stack rather than commodity web servers.

The Offline Experience — What Actually Happens When You Lose Connection

Google Docs handles offline editing, but the experience is deliberately limited compared to online. Here is the exact sequence:

  1. Connection drops. The client detects this (usually within 5-10 seconds via heartbeat failure) and switches to offline mode. A small banner appears: "Trying to connect..."
  2. Local editing continues. You can keep typing. All operations are buffered in the client's local operation queue, stored in the browser's IndexedDB so they survive a tab close or browser crash.
  3. No remote operations arrive. You will not see anyone else's changes until you reconnect. Your document view diverges from the server's truth.
  4. Connection restores. The client sends all buffered operations to the server with their base revision numbers. The server transforms each one against everything that happened while you were offline — potentially hundreds or thousands of operations if other people were actively editing.
  5. Screen "jumps." As the server's transformed operations arrive back, your document may shift — text moves, formatting changes, insertions appear. This is the visual reconciliation. The more edits happened while you were offline, the bigger the jump.
  6. Conflict handling. OT guarantees convergence, but the intent of your edit may not survive. If you edited paragraph 3 offline while someone else deleted paragraph 3 online, OT will converge on a consistent state — but your edit is effectively lost because the paragraph it targeted no longer exists. There is no "merge conflict" dialog like Git. The algorithm just picks the mathematically consistent outcome.

This is why Google Docs is not truly "offline-first" in the way Figma or Linear are. OT with a central server is optimized for low-latency online collaboration, with offline as a graceful degradation rather than a first-class mode. CRDT-based editors handle offline more naturally because they do not need the server to mediate — every replica can independently merge.

How to Build Your Own Collaborative Editor — The Minimum Stack

If you are building a product that needs real-time collaboration (not a Google Docs competitor, but any feature where multiple users edit shared state), here is the practical starting point in 2026:

Option 1: Use Yjs (CRDT-based, recommended for most teams)

  • Yjs — open-source CRDT library for JavaScript. Handles the conflict resolution math.
  • y-websocket — WebSocket provider for syncing Yjs state between clients and a lightweight server.
  • Tiptap or ProseMirror — rich-text editor UI with Yjs bindings built in.
  • Hocuspocus — hosted Yjs backend if you do not want to run your own WebSocket server.

This stack gets you collaborative rich-text editing with offline support, presence (cursors), and undo/redo in under a week of engineering time. It is what most startups should use unless they have a specific reason to build custom.

Option 2: Use OT (if you need server authority)

  • ShareDB — open-source OT library for Node.js. Server-authoritative, Jupiter-model compatible.
  • Rich-text OT type — handles text and formatting operations.
  • Quill — rich-text editor with OT bindings.

This stack is harder to set up but gives you full server authority over the document, which matters for applications where the server needs to validate or reject operations (e.g., a code editor that checks syntax, or a form builder with field constraints).

Option 3: Use a Managed Service

  • Liveblocks — managed real-time collaboration infrastructure. Presence, storage, comments. Paid but zero server management.
  • Ably or Pusher — real-time pub/sub for presence and state sync, but you bring your own conflict resolution.

The DevOps Patterns You Can Actually Reuse

Most engineers will never build a collaborative editor. But the patterns Google Docs relies on show up anywhere you have shared real-time state and multiple writers:

  • Optimistic local updates, server reconciliation. Apply the user's action immediately on the client, send it to the server, and rebase against whatever the server did in the meantime. This is how every responsive real-time UI works — from Trello cards to Notion blocks to GitHub issue comments with typing indicators.
  • Sequence numbers as convergence anchors. A monotonic server-assigned sequence number is the simplest primitive for "has everyone seen up to this point." Use it for client sync, for replication lag monitoring, and for conflict detection. ETags and version vectors in HTTP APIs use this same idea.
  • Separate channels by consistency requirement. Real-time systems usually have multiple kinds of state. Put each kind on its own channel with its own consistency model. Operations (strong), presence (eventual), notifications (at-least-once). Mixing them makes everything worse.
  • Linearize through a central authority when you can. Fully decentralized consensus is beautiful and extremely hard. If your problem can tolerate a server, having one linearize the order of events makes every downstream problem an order of magnitude easier. Jupiter's design choice to use a central server is what made Google Docs shippable.
  • Rebase, do not reject. When a client's stale operation arrives, do not fail it — transform it to the current state. Git does this for commits, Google Docs does this for keystrokes. The pattern generalizes: any system where clients make changes based on potentially stale state can use rebase instead of reject for a much better user experience.

Frequently Asked Questions

How does Google Docs real-time collaboration work?

Google Docs uses an algorithm called Operational Transform (OT) to synchronize edits across users. When you type, your edit is applied immediately on your screen (optimistic update) and sent to a central server. The server transforms your edit against any concurrent edits from other users, applies the transformed version to the canonical document, and broadcasts the result to everyone. Every user converges on the same document without anyone losing changes. The specific OT variant Google uses is called the Jupiter model, developed at Bell Labs in 1995.

How does Google Docs handle two people typing at once?

Operational Transform. Each keystroke is an operation sent to a central server. The server transforms concurrent operations against each other using a mathematical function that preserves user intent — if Alice inserts at position 10 and Bob deletes at position 5, Bob's delete shifts Alice's insert to account for the position change. Everyone converges on the same final document, regardless of the order the operations arrive at the server.

What is Operational Transform?

A family of algorithms for synchronizing changes to a shared data structure across multiple concurrent editors. The key mathematical property (TP1) is: if you apply operation A first and then a transformed B, or operation B first and then a transformed A, you get the same final state. This guarantee is what makes OT correct — it does not matter what order operations arrive, the result is always consistent.

What is the difference between OT and CRDTs?

OT (Operational Transform) uses a central server to sequence and transform operations. CRDTs (Conflict-free Replicated Data Types) embed enough metadata in each operation that any two replicas can independently merge without a server. OT is simpler when you have a server and users are mostly online. CRDTs are better for offline-first and peer-to-peer scenarios. Google Docs uses OT. Figma, Linear, and Notion use CRDTs. Both produce correct results — they are different architectural trade-offs, not "right vs wrong."

Does Google Docs use CRDTs?

No. Google Docs uses Operational Transform, specifically the Jupiter model (server-centric OT). CRDTs are a newer alternative used by Figma, Automerge, Yjs, and Linear. Google has not migrated because Jupiter-model OT has been working reliably for over 15 years, and the cost of rewriting the algorithm stack would outweigh the marginal benefit. For new systems in 2026, CRDTs (via Yjs) are often the recommended choice.

How does Google Docs show cursors in real time?

Through a separate presence channel that runs alongside the OT operation channel. Cursor and selection updates do not need convergence guarantees — they are ephemeral pub/sub messages that the server forwards to everyone viewing the document. If a cursor update arrives late or out of order, the newest one simply overwrites the old one. This separation of "document state" (OT, strong consistency) from "presence state" (pub/sub, eventual consistency) is a key architectural pattern.

What happens if I edit a Google Doc offline?

The client buffers all your operations locally in IndexedDB and applies them optimistically to your local view. When you reconnect, the client sends all buffered operations to the server with their base revision numbers. The server transforms each one against everything that happened while you were offline. Your screen may "jump" as the reconciliation replays — the more edits others made while you were offline, the bigger the jump. Unlike CRDT-based editors, Google Docs treats offline as graceful degradation rather than a first-class mode.

Can 100 people edit a Google Doc at the same time?

Yes. Google Docs officially supports up to 100 simultaneous editors per document. The server handles this by transforming each incoming operation against the operation log — with 100 editors, that is a lot of concurrent transforms, but the operations are tiny (a few hundred bytes each) and the transform functions are computationally cheap. The bottleneck is usually the fan-out — broadcasting each operation to 99 other clients — not the transform itself. In practice, documents with 50+ active editors slow down noticeably on the presence channel (cursor updates) before the OT channel becomes a problem.

How does Google Docs handle conflicts?

There are no "conflicts" in the Git merge-conflict sense. OT guarantees mathematical convergence — every pair of concurrent operations has a defined transform that produces the same result regardless of application order. However, intent can be lost: if you edit paragraph 3 while someone else deletes it, OT converges on a consistent state (the paragraph is deleted and your edit is gone), but there is no "resolve conflict" dialog. The algorithm picks the mathematically correct outcome, which may not be what either user expected. This is a known limitation of all OT systems.

Is Google Docs peer-to-peer?

No. Google Docs uses a server-centric architecture where a central Google server is the single source of truth for each document. All operations flow through the server, and the server defines the canonical ordering. This is a deliberate design choice — having one authority simplifies the OT algorithm dramatically (Jupiter model) compared to fully decentralized OT or peer-to-peer CRDTs. The trade-off is that you need an internet connection to the Google server for real-time collaboration to work.

How would I build something like Google Docs?

For most teams in 2026, the recommended stack is Yjs (open-source CRDT library) + Tiptap or ProseMirror (rich-text editor) + y-websocket (sync server). This gives you collaborative editing with offline support, presence cursors, and undo/redo in under a week. If you specifically need server authority (validating or rejecting operations), use ShareDB (OT library for Node.js) + Quill (editor). For zero server management, use Liveblocks (managed collaboration infrastructure).

Was this useful?

🧠 Test Your Understanding

Next Steps

  1. How Uber's Surge Pricing Actually Works — real-time state synchronization, geospatial edition
  2. How Stripe Detects Fraud in Real Time — real-time ML inside a tight latency budget
  3. How Netflix-Scale DRM Works — more infrastructure hidden behind a simple UI
  4. Free DevOps resources
AG

Akshay Ghalme

AWS DevOps Engineer with 3+ years building production cloud infrastructure. AWS Certified Solutions Architect. Currently managing a multi-tenant SaaS platform serving 1000+ customers.

More Guides & Terraform Modules

Every guide comes with a matching open-source Terraform module you can deploy right away.

Let's Connect

I reply to every real technical question