Monitoring Microservice Applications: An SRE's Perspective
Today, infrastructure is made up of many small apps that run under the control of a single app manager, which manages the number of apps, their updates, and resource requests. This system is not a result of administrators trying to make infrastructure management easier — it’s a reflection of the modern thinking paradigm in software development. To understand why we’re talking about microservice architecture as an ideology, let's go 30 years back.
At the end of the 80s/beginning of the 90s, as PCs grew in popularity, object-oriented programming became the answer to the increasing number of software programs. Before that, software was essentially utilities, but then, large software projects became more common, and it turned into big business. There was nothing extraordinary about teams with thousands of members developing tons of new functionality.
Businesses had to figure out how to organize teamwork without creating a disastrous mess, and object-oriented programming was the answer.
However, the release cycle was still slow.
A company would plan the release of their product (let's say, Microsoft Office 95) a few years in advance. When the development stage had been completed, you'd have to thoroughly test your software, since fixing bugs would be difficult after the end users installed your product. Then you'd send your binary code to a factory which would make the necessary amount of copies on CDs or floppy disks. They were packaged in cardboard boxes and delivered to stores all over the world, where users would buy them and then install them on their PCs. This is the main difference from what we have now.
Starting from 2010, as fast updates became a requirement in large software projects and companies, microservice architecture was chosen as the solution for that challenge. We no longer need to install applications on users' computers — instead we essentially "install" them in our infrastructure, which helps us to deliver updates quickly. This way, we can update software as fast as possible, and that enables us to experiment and test hypotheses.
Businesses need to create new functionality in order to retain and attract customers. They also need to experiment and figure out what makes customers pay more. Finally, businesses need to avoid lagging behind their competitors. So, a business might want to update its codebase dozens of times a day, and in theory you could do it even for one large application.
But if you split it into smaller pieces, managing updates would be easier. This is to say that switching to microservices wasn't a result of businesses trying to make applications and infrastructure more stable: microservices play an important role in Agile development, and agile software is what businesses strive for.
What does agile mean? It means speed, ease of implementing changes, and the option to change your mind. What matters here is not a solid product, but rather the speed of delivering a product and the possibility to try out concepts quickly. After trying them out, however, companies would then allocate resources for creating a solid product based on those concepts. In practice, it doesn't happen that often — especially in small teams and growing businesses where the main goal is to keep developing the product. This results in technical debt, which can be exacerbated by the belief that "we can just leave it to Kubernetes."
But that's a dangerous attitude. Recently, I stumbled upon a great quote that illustrates both the advantages and horrors of using Kubernetes in an operating environment:
"Kubernetes is so awesome that one of our JVM containers has been periodically running out of memory for more than a year, and we just recently realized about it."
Let's consider this carefully. Over the course of a year, an application was crashing because it was running out of memory, and the operations team didn't even notice. Does that mean that the application was mostly stable and working as intended? At first glance, this functionality is very useful: instead of sending an alert about a service crash so that the administrator would go and fix it manually, Kubernetes detects and restarts a crashed app on its own. This happened regularly during the year, and administrators didn't receive any alerts. I've also seen a project where a similar situation happened, and they only found out about it when they were generating a monthly report.
The reporting functionality was developed and deployed in production in order to help business users, but soon they started getting an HTTP 502 error in response to their requests — the app was crashing, the request wasn't processed properly, and then Kubernetes would restart the app. While the application was technically working, it was impossible to generate reports. The employees who were supposed to use that service preferred to create reports the old-fashioned way and didn't report the error (after all, the company needed those reports only once a month, so why bother anyone?), and the operations team didn't see the need to give high priority to a task that was monthly at best. But, as a result, all the resources spent on creating that functionality (business analysis, planning, and development) were actually wasted, and it became obvious only a year later.
Our past experiences helped us establish a set of practices aimed at minimizing risks in maintaining microservice applications. In this article, I'll share 10 of them (those that I consider to be the most important) and the contexts of their usage.
When service reboots are not monitored/not taken seriously
Example
See above. The problem here is, at least, a user not getting needed data and, at most, a systemically failing function.
What you can do
Basic monitoring: monitor whether your services reboot at all. There’s probably no need to give high priority to a service that reboots once in three months, but if a service starts to reboot every five minutes, take note.
Extended monitoring: keep an eye on all services that have been rebooted even once and organize a task setting process for analyzing those reboots.
When service errors, like fatal errors or exceptions, are not monitored
Example
An app doesn't crash but instead displays a stack trace of an exception to users (or sends it to another app via API). In this case, even if we monitor app reboots, we might miss situations where requests are processed incorrectly.
What you can do
You can aggregate app logs in a suitable tool and analyze them. You should look through the errors thoroughly, and if you find a critical error, we recommend assigning an alert to it and escalating the investigation.
When there’s no health-check endpoint or it doesn't do anything useful
Example
Thankfully, nowadays, creating endpoints that return service metrics (ideally as OpenMetrics) so that they can be read (for example, by Prometheus), is practically a standard. However, with businesses pressuring developers for new functionality, oftentimes they don't want to spend time on designing metrics. As a result, quite often the only thing a service health check can do is to return "OK." If an app can provide some output to display on the screen, it would be considered as "OK" But that's not how it should be. Such a health check, even if an app can't connect to its database server, would still return "OK," and that false information would be a hindrance for the investigation of the issue.
What you can do
First of all, having a health-check endpoint for all services should become the norm in your company, if it hasn't already. Secondly, health checks should also check the health and availability of all systems critical to the functioning of the service (such as access to queues, databases, availability of other services, etc.).
When API response time and service interaction time are not monitored
Example
These days, now that most parts of an application have turned into clients and servers interacting with each other, APIs have to know how soon this or that service responds. If the time has increased, one lag will lead to another, and due to the domino effect, the whole response time of the app will increase accordingly.
What you can do
Use tracing. Jaeger is pretty much standard now, and there’s a great team working on OpenTracing (in a similar fashion to the development of OpenMetrics). In this report and also here you can find the API for your programming language, which can provide OpenTracing metrics on app response time and service interaction time so that you can add them to Prometheus.
A service means an app, and an app means consumption of memory and CPU (and sometimes disk) resources
Example & What you can do
I think it's quite obvious. Many companies don't monitor the performance of services themselves; namely, how much CPU, RAM, and (if measurable) disk resources they use. In general, you should include all standard metrics used when monitoring a server. So, besides monitoring the whole node, we must also monitor each service.
Monitoring for new services
Example
This might sound odd, but it’s worth mentioning. When there are many development teams and even more services, and with the SRE being more focused on overseeing development, the operations team responsible for a specific cluster should monitor for new services in the cluster and receive notifications about them. You might have standards in place that define how to monitor a new service, its performance, and which metrics it should export, but when a new service appears, you still must verify the compliance with these standards.
What you can do
Set notifications for new services in your infrastructure.
Monitoring delivery time and other CI/CD metrics
Example
This is another relatively recent issue.
The performance of an application is influenced by its deployment speed. Complex CI/CD processes, combined with a more complicated app build process and the process of building a container for delivery, make seemingly simple deployments not so simple (here’s our article on that topic).
One day, you might find that deploying a certain service takes 20 minutes instead of one.
What you can do
Monitor how long it takes to deliver each of your apps, from building to the moment they begin running in production. If the delivery time starts to increase, look into it.
Application performance monitoring and profiling
Example & What you can do
When you learn that there’s an issue with one of your services (for example, the response time is too long, the service is not available, etc.), you won't be too excited about taking a deep dive into the service and restarting it in an attempt to pinpoint the issue. In our experience, tracing an issue is easy if you have detailed data from APM. Issues rarely appear out of the blue; they’re often a result of minor glitches piling up, and APM can help you understand when it all started. Another thing you can do is learn how to use system-level profilers — thanks to the development of eBPF, there are many opportunities for that.
Monitoring security: WAF, Shodan, images and packages
Example
Monitoring shouldn't be restricted to performance. It can also help with ensuring the security of your service:
- Start monitoring the results of executing "npm audit" (or equivalent commands) included in your app's build process — you'll get alerts if there are any issues with the library versions that you use, and if that's the case, you can update them.
- Using Shodan API (Shodan finds open ports and databases on the Internet), check your IP addresses to make sure that you don't have any ports open and that your databases haven't been leaked.
- If you use WAF, set alerts for WAF events so that you can see any intentional intrusions and the attack vectors used by the intruder.
A bonus tip: SREs, keep in mind that your app's response time doesn't equal your server's response time!
We’re used to measuring a system's performance by its server's response time, but 80% of a modern app's logic is in the frontend. If you aren’t already measuring your app's response time as the time it takes to display a page as well as frontend page-load metrics, you should start doing that. Users don't care whether your server's response time is 200 or 400 milliseconds if your Angular- or React-based frontend takes 10 seconds to load the page. In general, I believe that performance optimization in the future will be focused on the frontend, or even emerge as its own new field.