The Silent Database Killer: Why Your Transactions Shouldn't Wait on the Network
In this edition of Real-World Engineering, we aren’t looking at a textbook problem. We are looking at a "production-is-on-fire" problem.
I’ve spent 14 years building distributed systems and moving bits across clouds, and if there is one thing I’ve learned, it’s this: The network is a liar, and your database is a jealous lover.
Yesterday, a junior engineer on my team—talented, but still learning the scars of high-scale systems—came to me with a puzzle.
The Situation: The "Safe" Transaction
We had a service that needed to process a user’s premium upgrade. The requirements were simple:
- Update the user's status in the DB.
- Call a 3rd party Billing API to charge the card.
- If the charge fails, revert the DB change.
The junior, wanting to be responsible, wrapped the whole thing in a single @Transactional block.
The Logic:
- Open Transaction.
- Update
is_premium = true. - Heavy Network Call to Billing Provider.
- If success → Commit.
- If error → Rollback.
The Problem? The system started crawling. Connection pools were exhausted. The database CPU was spiking, but there was almost no traffic.
How the Junior Approached It
When we sat down, he showed me the code with a look of pure confusion.
"I’m using a transaction to ensure data integrity," he said. "If the billing fails, I don't want the user to have premium status for free. It has to be one atomic operation."
His mental model was correct from a business perspective, but dangerous from a system perspective.
He was treating a distributed system like a local monolith. He thought he was being safe, but he was actually holding a "lock" on a database row while waiting for a packet to travel across the Atlantic, wait for a 3rd party server to wake up, and travel back.
How I Explained It: The Restaurant Analogy
I told him: "Imagine you go to a busy restaurant. You sit at a table (The DB Row), and the waiter (The Transaction) takes your order. But instead of going to the kitchen, the waiter stands at your table and calls the vegetable supplier to see if they have carrots."
"While he’s on the phone for 10 minutes, he can't serve anyone else. The table is occupied. The line outside gets longer. Eventually, the restaurant goes out of business because every waiter is just standing at a table holding a phone."
The Engineering Reality:
- Database connections are finite. * When you start a transaction, you hold a connection.
- If you make a network call inside that transaction, that connection sits idle but "active."
- If the network call takes 2 seconds (not uncommon for 3rd party APIs), and you have 50 concurrent users, you’ve just locked up 50 DB connections for 2 seconds each.
How We Resolved It: The Outbox Pattern (or Post-Commit Logic)
We didn't need a database transaction to span the network. We needed Eventual Consistency.
We refactored the logic:
- Update the DB first with a "Pending" status.
- Commit the transaction immediately (releasing the connection).
- Make the network call outside the transaction.
- Update the DB again based on the result.
If the network call fails or the system crashes before step 4? We use a background worker (The Outbox Pattern) to retry the call or reconcile the "Pending" states.
How Not to Make This Mistake
If you take one thing away from my 14 years of breaking things, let it be this:
1. The Golden Rule
Never, ever, perform an I/O operation (Network, File System, External API) inside a Database Transaction.
2. Keep Transactions Short
A transaction should be "Get in, update bits, get out." It should be measured in milliseconds, not seconds.
3. Embrace "Pending" States
Instead of trying to make everything happen "now," design your state machine to handle "In-Progress" states. It makes your system resilient to network flickers.
Final Thoughts
My junior engineer didn't just fix a bug; he changed how he thinks about distributed time. In a local function, time is cheap. In a distributed system, time is the most expensive resource you have.
Don't let your database wait on the internet. The internet doesn't care about your connection pool.
I'll see you in the next one.
Happy Coding.
Member discussion