3 min read

Erasure Coding Simplified

Replication
Photo by Cymo Tome / Unsplash

Let’s be honest — most explanations of erasure coding feel like something ripped out of a PhD thesis.

But here’s the thing: if you’re working with distributed systems, cloud storage, or building anything remotely fault-tolerant at scale, you need to know what erasure coding is, why it exists, and when to use it.

In this post, I’ll break it down — no jargon, just real-world analogies, and a fully worked-out example so you won’t forget it.

💡 First, What Is Erasure Coding?

At its core, erasure coding is a data protection technique.
It’s an alternative to simple replication.

Instead of making multiple copies of your data (which is wasteful), it:

  • Breaks data into pieces
  • Adds some extra parity pieces
  • And can reconstruct the full original even if some parts are missing

If you’ve ever heard of RAID 6, S3 Glacier, HDFS EC, or Reed-Solomon, you’ve already seen erasure coding in action.

🤯 Let’s Take a Real Example

Let’s say we want to store a 60MB file across a distributed system.
We use a coding scheme: k = 6, m = 3

That means:

  • We split the file into 6 data chunks
  • We add 3 parity chunks
  • Total chunks stored = 9
  • The system can recover the file from any 6 out of 9 chunks

Here’s how it looks:

Original File: 60MB
Chunks: D1 | D2 | D3 | D4 | D5 | D6
Parity:  P1 | P2 | P3

Now let’s say 2 disks fail, and we lose chunks D4 and P2.

No problem. We still have:

  • D1, D2, D3, D5, D6
  • P1, P3

That’s 7 chunks — and we only need 6. ✅ We can reconstruct D4 from the remaining data using the parity math.

Each parity chunk is a linear combination of the data chunks. Think of it like this (numbers simplified for understanding):

P1 = D1 + D2 + D3 + D4 + D5 + D6  
P2 = 2×D1 + 3×D2 + 4×D3 + 5×D4 + 6×D5 + 7×D6  
P3 = 3×D1 + 5×D2 + 6×D3 + 7×D4 + 8×D5 + 9×D6

(Actual coefficients are from a matrix over finite fields like GF(2^8), but we’ll keep numbers readable here.)

These are like equations in a system of linear equations.

Now, since we know the values of:

  • D1, D2, D3, D5, D6
  • P1 and P3

We can plug these values into the equations for P1 and P3.

Let’s say (again simplified):

P1 = D1 + D2 + D3 + D4 + D5 + D6
P3 = 3×D1 + 5×D2 + 6×D3 + 7×D4 + 8×D5 + 9×D6

Now:

  • We know the values of D1, D2, D3, D5, D6
  • We know the values of P1 and P3
  • Only D4 is unknown

This is now just a system of two equations with one unknown (D4)

You solve the equations like high-school algebra:

Example:

Let’s pretend we calculated:

P1 = 60  
D1 + D2 + D3 + D5 + D6 = 50
=> D4 = P1 - 50 = 10

Done. You’ve reconstructed D4!

⚠️ Note: In practice, this is done using matrix inversion over a Galois Field, not simple subtraction, but the concept is identical: use parity equations and solve for the unknowns.

🔧 How the Parity Works (Simplified)

Under the hood, it’s math.
Not XOR-based (like RAID 5), but Reed-Solomon codes using finite field algebra (like modulo arithmetic over GF(2^8)).

You don’t need to know the math to use it, but you should know this:

  • Parity isn’t just "copies" — it’s calculated
  • Rebuilding missing data is deterministic and lossless
  • The encoding and decoding cost CPU cycles

⚖️ Pros vs Replication

FeatureReplication (e.g., 3x)Erasure Coding (e.g., 6+3)
Storage Overhead200% (store 3x the data)~50% (store 1.5x the data)
DurabilityHigh (depends on copies)Higher (can survive multiple loss)
Repair CostLow (just copy again)Higher (need to recalculate data)
Bandwidth UseHigh (for large copies)Lower
PerformanceFaster (for hot data)Slower (for cold archival data)

🚀 Where Erasure Coding Shines

  • Cloud storage systems like Amazon S3, Azure Blob, GCP
  • Archival tiers (e.g., S3 Glacier) where cost-saving matters
  • Distributed file systems like HDFS, Ceph, MinIO
  • Large-scale backup systems

It’s especially useful when you have to store petabytes of data and can’t afford 3x duplication.

🧠 When You Shouldn’t Use It

  • For hot data that’s accessed and written frequently (high CPU overhead)
  • When latency matters more than cost (e.g., user-facing APIs)
  • For small data sets where replication is simpler and good enough
  • In systems with low CPU and memory capacity (like edge devices)

🧰 Real-World Systems Using Erasure Coding

SystemPurposeEC Role
Amazon S3Object storageUses EC for cost-efficient tiers
HDFSBig data storageOptional EC instead of replication
CephDistributed file/object blockNative EC pool support
MinIOS3-compatible object storageUses EC for redundancy
RAID 6Local disk redundancyUses Reed-Solomon based EC

🔚 Conclusion

Erasure coding is one of those things that feels intimidating at first — but once you break it down, it’s just math that saves storage and adds durability.

If you’re designing systems at scale — especially anything distributed, cloud-native, or storage-heavy — you owe it to yourself to understand this concept.

Replication is great for simplicity.
But erasure coding is what the hyperscalers use to win at cost, durability, and scale.

Next time someone asks, “Why not just make 3 copies of everything?”, you’ll have a pretty solid answer.

Until next time — keep learning the stuff that matters.

The Dev Learnings Team