How a Single Character Nearly Broke Everything
It often takes no more than a single line of code to break everything.
We often talk much more about our successes than our failures. Paradoxically, failures are often the experiences that leave the deepest mark on us and allow us to grow.
I've decided to openly share my failures so that others around me can learn from them.
Today, I'm going to tell you about the day when I caused the slow death of a service, resulting in an incident that could have jeopardized the launch of a new feature of the product I was working on.
It often takes no more than a single line of code to break everything.
We were nearing the end of the development cycle for an initiative our team had been working on for months. I remember it like it was yesterday. The team was proud to be nearing completion: everything had been developed, tested by internal testers, and the feedback was rather positive. We had just finished migrating one of the oldest modules of the product 🤘.
To provide a bit more context, the existing functionality had been developed on the monolith and had its data in the main and central database. This migration had implications not only for improving the product but also for moving towards a much more microservices-oriented architecture where we had our own service, as well as a dedicated database tailored to our needs.
So, we had developed our own endpoints in this service different from the existing ones that were on a monolith.
Everything was under a feature flag throughout the development and testing period. Once we deemed the functionality mature enough to be opened to users, we needed to run a migration script to have a copy of the existing data from the main database to the one dedicated to the new service. We had long planned this action and tested the script beforehand, and everything was fine. We also ran it in dry-run mode one evening, the day before, to check if it had any impact on the rest of the platform because we still make quite heavy requests to the main database, and this could have an impact, but everything went well without any issues. So, we were confident to run it even during platform usage hours.
On the D-day, we launched the database migration script and opened the feature to a few test accounts and early adopters. We had a dashboard to track progress, and everything was green, no apparent errors. There was really no reason to worry … until a few minutes later while the script was still running, our team was alerted by the SREs (Site Reliability Engineers) who detected something abnormal happening on one of our services: the service was in CrashLoopBackOff (crashing almost every time it started) due to an OOM (out of memory). The pods had run out of memory and were crashing one after the other. Since this service was used by other features that our team maintained, those features were no longer accessible.
We knew that we were running a fairly heavy migration script and that it was the only thing that could have such a big impact. And even though we didn't see any connection between the migration and the downed service (because we were writing directly to the target database), we still decided to stop the migration, but it was already almost complete anyway.
After stopping the migration, we manually restarted the pods, and everything seemed to return to normal … but only for a few seconds. We then noticed that the services were not crashing automatically. Something seemed to be triggering them. Interesting!
In the meantime, an incident was declared, and a crisis cell was opened. As it was related to our team, and even though I wasn't necessarily following the migration, I still joined the war room out of curiosity because I know that we often come out of these kinds of incidents with a wealth of experience and good learning.
During the investigation, someone noticed that the pods that were crashing all had something in common: they had a log that appeared just before dying. By tracing the origin of this log, we realized it came from an endpoint that (you see where this is going) I had developed! That's when I realized I was in for a rough time.
This endpoint seemed innocent and simple enough. It simply made a request to the database, formatted the retrieved data to match the DTO, logged some interesting info, and then returned it to the BFF (Backend for Frontend) which applied some changes, filtered to retrieve the data it needed, and then transmitted it to the web client. Nothing unusual.
To better understand what had happened and with a good understanding of the endpoint, I decided to take control and share my screen so that everyone could follow my actions. I mainly wanted to share a finding that seemed abnormal to me: when I called the endpoint locally via Postman and profiled the memory, I didn't notice anything abnormal. I tried the same request on the staging environment (the pre-production) several times, and everything was still fine.
To get an overview of the faulty endpoint:
Note that this code is not the real one. It’s to illustrate the issue while being close to the situation that happened.
Calling the endpoint [POST] /teams with a body like this:
{"companyId": ["id-1", "id-2"], "departmentId": ["dep-1", "dep-2"]}
would return the expected result. And the requests in production did indeed have the body we “expected”:
{"companyIds": ["id-1", "id-2"], "departmentIds": ["dep-1", "dep-2"]}
Just from this, the error was already apparent, but I hadn't noticed it yet (and maybe neither had you).
Continuing to demonstrate that everything should work fine, I mistakenly made a request with an empty body. And the request still worked locally but returned much more data than expected. Together, we realized that it was retrieving all the content from the database table! 😱
But that still didn't explain what was happening. Because the requests in production did have a body. That's when I noticed that in my test that worked locally, I had companyId
whereas in production, I had companyIds
. Yes, I broke a service because I had put an “S” in the request body fields on the frontend but not on the backend. 😢
But then, how did we get to this point?
To summarize the situation, an endpoint had been developed two months before launching the migration script. This endpoint was responsible for returning the teams present in a given company. This endpoint initially consumed an empty database table because the feature was still under construction. When the migration script was launched, the table (which was then empty) was populated, and requests were made on real data. The backend expected to receive data in the request body, but it wasn't receiving it (or at least it was receiving a different data), and thus, it was making an incorrect request.
You might say, "Yes, but this request is incorrect and should therefore return an empty list or maybe fail because the filter is incorrect, right?"
That's what I thought too, until I read this from the Prisma documentation: “If undefined is used as the filter value, the query will act as if no filter criteria was passed to that column at all.”
That's how I almost ruined the launch of a major feature of the platform with a simple forgotten “S”.
I leaned many things from this experience:
The request should have returned a 400 error (bad request) because the parameter expected by the backend had not been provided. There was no validation of incoming DTOs on this endpoint, but only on outgoing data.
Tests had been written for this endpoint, but mainly unit/integration tests with jest and supertest by calling the findTeams function and passing it values. And since the parameters were not named (not in the form of a destructured object), everything worked fine. We had no end-to-end tests on this endpoint.
Prisma (the ORM used to query the database) is excellent for many things, but you should never trust it 100%, nor any other ORM by the way. Look at the request logs before deploying the services.
This endpoint had not been properly tested under real conditions, even manually.
The parameters had no default value. Having a null value would have been better.
Do not neglect logs and alert. They can help you notice an issue before your users and investigate quickly.
Validating your request params and body is a must-have. That can save you from issues like this.
In the end, the problem was quickly resolved, and we were able to finalize the migration and launch this feature for which I was very proud to have contributed positively (and negatively).