14 Jul 2025 3 min read System Design

Erasure Coding Simplified

Let’s be honest — most explanations of erasure coding feel like something ripped out of a PhD thesis.

But here’s the thing: if you’re working with distributed systems, cloud storage, or building anything remotely fault-tolerant at scale, you need to know what erasure coding is, why it exists, and when to use it.

In this post, I’ll break it down — no jargon, just real-world analogies, and a fully worked-out example so you won’t forget it.

💡 First, What Is Erasure Coding?

At its core, erasure coding is a data protection technique.
It’s an alternative to simple replication.

Instead of making multiple copies of your data (which is wasteful), it:

Breaks data into pieces
Adds some extra parity pieces
And can reconstruct the full original even if some parts are missing

If you’ve ever heard of RAID 6, S3 Glacier, HDFS EC, or Reed-Solomon, you’ve already seen erasure coding in action.

🤯 Let’s Take a Real Example

Let’s say we want to store a 60MB file across a distributed system.
We use a coding scheme: k = 6, m = 3

That means:

We split the file into 6 data chunks
We add 3 parity chunks
Total chunks stored = 9
The system can recover the file from any 6 out of 9 chunks

Here’s how it looks:

Original File: 60MB
Chunks: D1 | D2 | D3 | D4 | D5 | D6
Parity:  P1 | P2 | P3

Now let’s say 2 disks fail, and we lose chunks D4 and P2.

No problem. We still have:

D1, D2, D3, D5, D6
P1, P3

That’s 7 chunks — and we only need 6. ✅ We can reconstruct D4 from the remaining data using the parity math.

Each parity chunk is a linear combination of the data chunks. Think of it like this (numbers simplified for understanding):

P1 = D1 + D2 + D3 + D4 + D5 + D6  
P2 = 2×D1 + 3×D2 + 4×D3 + 5×D4 + 6×D5 + 7×D6  
P3 = 3×D1 + 5×D2 + 6×D3 + 7×D4 + 8×D5 + 9×D6

(Actual coefficients are from a matrix over finite fields like GF(2^8), but we’ll keep numbers readable here.)

These are like equations in a system of linear equations.

Now, since we know the values of:

D1, D2, D3, D5, D6
P1 and P3

We can plug these values into the equations for P1 and P3.

Let’s say (again simplified):

P1 = D1 + D2 + D3 + D4 + D5 + D6
P3 = 3×D1 + 5×D2 + 6×D3 + 7×D4 + 8×D5 + 9×D6

Now:

We know the values of D1, D2, D3, D5, D6
We know the values of P1 and P3
Only D4 is unknown

This is now just a system of two equations with one unknown (D4)

You solve the equations like high-school algebra:

Example:

Let’s pretend we calculated:

P1 = 60  
D1 + D2 + D3 + D5 + D6 = 50
=> D4 = P1 - 50 = 10

Done. You’ve reconstructed D4!

⚠️ Note: In practice, this is done using matrix inversion over a Galois Field, not simple subtraction, but the concept is identical: use parity equations and solve for the unknowns.

🔧 How the Parity Works (Simplified)

Under the hood, it’s math.
Not XOR-based (like RAID 5), but Reed-Solomon codes using finite field algebra (like modulo arithmetic over GF(2^8)).

You don’t need to know the math to use it, but you should know this:

Parity isn’t just "copies" — it’s calculated
Rebuilding missing data is deterministic and lossless
The encoding and decoding cost CPU cycles

⚖️ Pros vs Replication

Feature	Replication (e.g., 3x)	Erasure Coding (e.g., 6+3)
Storage Overhead	200% (store 3x the data)	~50% (store 1.5x the data)
Durability	High (depends on copies)	Higher (can survive multiple loss)
Repair Cost	Low (just copy again)	Higher (need to recalculate data)
Bandwidth Use	High (for large copies)	Lower
Performance	Faster (for hot data)	Slower (for cold archival data)

🚀 Where Erasure Coding Shines

Cloud storage systems like Amazon S3, Azure Blob, GCP
Archival tiers (e.g., S3 Glacier) where cost-saving matters
Distributed file systems like HDFS, Ceph, MinIO
Large-scale backup systems

It’s especially useful when you have to store petabytes of data and can’t afford 3x duplication.

🧠 When You Shouldn’t Use It

For hot data that’s accessed and written frequently (high CPU overhead)
When latency matters more than cost (e.g., user-facing APIs)
For small data sets where replication is simpler and good enough
In systems with low CPU and memory capacity (like edge devices)

🧰 Real-World Systems Using Erasure Coding

System	Purpose	EC Role
Amazon S3	Object storage	Uses EC for cost-efficient tiers
HDFS	Big data storage	Optional EC instead of replication
Ceph	Distributed file/object block	Native EC pool support
MinIO	S3-compatible object storage	Uses EC for redundancy
RAID 6	Local disk redundancy	Uses Reed-Solomon based EC

🔚 Conclusion

Erasure coding is one of those things that feels intimidating at first — but once you break it down, it’s just math that saves storage and adds durability.

If you’re designing systems at scale — especially anything distributed, cloud-native, or storage-heavy — you owe it to yourself to understand this concept.

Replication is great for simplicity.
But erasure coding is what the hyperscalers use to win at cost, durability, and scale.

Next time someone asks, “Why not just make 3 copies of everything?”, you’ll have a pretty solid answer.

Until next time — keep learning the stuff that matters.

— The Dev Learnings Team

💡 First, What Is Erasure Coding?

🤯 Let’s Take a Real Example

Example:

🔧 How the Parity Works (Simplified)

⚖️ Pros vs Replication

🚀 Where Erasure Coding Shines

🧠 When You Shouldn’t Use It

🧰 Real-World Systems Using Erasure Coding

🔚 Conclusion

You might also like...

How Amazon S3 Stores Data

🌍 What the Heck Is a CDN

Database Partitioning

Architecture Revamp — And It Wasn’t What You’d Expect

Async Communication with the Outbox Pattern