09 Feb 2025 1 min read

Look Beyond the Obvious

A few weeks back, I was working with a senior engineer to set up chaos testing experiments. For those unfamiliar with chaos testing, it’s a way of testing distributed systems by intentionally breaking things to uncover hidden weaknesses. The goal was to use an open-source tool called Litmus.

We successfully integrated Litmus with our Kubernetes setup and ran several experiments, including the deletion of a pod. The best thing about Litmus is that it generates detailed reports on the success or failure of experiments, which we could view on our Datadog dashboard.

However, something seemed off. According to the dashboard, the experiment hadn’t run as expected. We had scheduled it to execute five times, but the reports didn’t reflect that. My colleague tweaked the configuration, trying to fix the issue, but nothing seemed wrong. I spent some time reviewing the settings but couldn’t find anything either.

Then, I decided to check the pod status reports during the experiment runtime—and voilà, I noticed that the pods were indeed getting killed at the expected intervals. This meant the experiment was actually running fine, but there was probably a reporting issue causing the confusion.

This experience reinforced an important lesson:
We all make mistakes, and sometimes we don’t dig deep enough to truly understand a problem. It’s crucial to put in the effort to diagnose the real issue—sometimes, it’s not a system failure, but just a misinterpretation of data, as it was in our case.