Historically, any issue can split the tech industry into two camps: "for" and "against." Moreover, the matter of dispute can be completely arbitrary. Which OS is better, Win or Linux? Which is best for smartphones, Android or iOS? Should you store everything in the cloud or keep it in cold RAID storage? Can PHP experts call themselves developers? Most of the time, these disputes are purely philosophical in nature and don't have any empirical basis, giving rise to a lot of hair splitting.
It’s not surprising that with the advent of containers, Docker, or K8s and the like, there have been lots of arguments for and against using new inventions in various areas of the backend. (Note that in this article Kubernetes serves as a generalized orchestrator. You can substitute it with the name of any other orchestrator that you find the most comfortable and familiar.)
It could have been another simple dispute about two sides of the same coin. For example, it could have been the same senseless and merciless confrontation like the one between Win vs Linux, in which most normal people stand somewhere in the middle. Unfortunately, it's not that simple. In the dispute over whether or not store databases (DB) in container systems, there is no right side. Because, in a certain sense, both supporters and opponents of this approach are right.
Let's look at the pros. Say you have a major web project. It might initially be based on a microservices approach, or it might at some point turn out that way — it's not very important, really. You spread your project over several microservices and set up orchestration, traffic balancing, and scaling, and then you think it's time to sip mojitos in a hammock instead of having to recover failed servers. Not so fast! Very often, the application code is the only thing that’s containerized. But what else is there besides the code?
Bingo! It's data. Any project's core is its data. It can exist as a typical DBMS (MySQL, PostgreSQL, and MongoDB), a storage used for search (ElasticSearch), a key-value store (redis), and so on. Now, let's omit poorly realized backends where queries can crash the DB, and instead talk about DB fault tolerance under clients' traffic. After all, when we containerize our app and allow it to scale freely to process any number of incoming requests, the load on our DB increases.
In fact, the request channel to our DB and the server where our DB is stored are like the eye of a needle that leads into our beautiful containerized backend. And don't forget that the main reason to virtualize a container is to make the structure mobile and flexible, which, in its turn, will allow you to organize load balancing across the available infrastructure as efficiently as possible. So, if we don't containerize all the elements of the system, including the DB, across the cluster, we’re making a very serious mistake.
It makes more sense to clusterize not only the app itself, but the services responsible for data storage. If we, let's say, prepare our web servers to clusterize by spreading them across different tables and databases in one monolith DBMS, we immediately solve the problem of data synchronization, for example, comments on posts. Anyway, we obtain an intra-cluster (albeit virtual) DB view like ExternalService. The DB itself, however, is not yet in the cluster. In fact, the web servers which we deployed in K8s, update information from our static production base, which operates separately.
See the catch-22? We use K8s or Swarm to balance the load and prevent our primary web server from failure, yet we don't do the same for the DB. What use can we get out of empty web pages that return access errors to a database that fails?
That is precisely why we need to clusterize not only web servers, but also DB infrastructure. It's the only way we can create a structure whose elements can work in concert, yet independently from one another. Even if half of our backend fails due to high traffic, the rest will survive. Plus, a DB synchronization system within the cluster and the opportunity to scale and deploy new clusters without limit will help to achieve the required capacities. The only limit is the number of racks in the data center.
In addition, a clusterized DB is portable. In the case of a global service, it's quite illogical to locate your web cluster somewhere in San Francisco and move data packages back and forth to New York for every DB request.
Also, DB clustering allows you to build all system elements at one level of abstraction. That, in its turn, makes it possible for devs to control the system directly with the code and without the active involvement of the ops. Want to create a separate DBMS for a new sub-project? Piece of cake! Write a yaml file, load it into the cluster, and voila!
And, of course, the internal administration becomes significantly easier. How many times have you winced when new colleagues thrust their greasy fingers into your production DB? The one and only production database! Of course, we're all adults here, and you probably have several backups here and there and maybe in cold storage, because you’ve seen a DB apocalypse before. But still, every new team member with access to the production infrastructure and database is a bucket of valium for all team leaders. It's scary, right?
Containerization and geographical distribution of your project DB helps to avoid such terrifying moments. Newcomers are not trustworthy? Okay! Let's give them a separate cluster to work on, unplug it from the rest of the DB, and sync clusters only by a manual push and simultaneous turning of two keys (one by the teamlead, another by the system administrator). Everyone's happy!
Now, let's play devil’s advocate and reveal all the disadvantages of DB containerization.
To discuss why we should not containerize a DB and continue to run it on central server replicas, let's not sink into standard arguments of “that’s the way it is.” Instead, let’s think of a time when containerization really brought tangible benefits.
Realistically speaking, the number of projects that really need to containerize the DB can be counted on one hand.
In most cases the very use of K8s or Docker Swarm tend to be redundant. Quite often these tools are used due to the widespread hype over clouds and containers. Most people think it’s cool.
Again, using K8s or Docker for a project is usually above and beyond what's needed. Sometimes DevOps teams or outsourced specialists don't pay attention to that fact. Sometimes — and this is way worse — DevOps teams are compelled to use containers.
Many people think that the Docker/K8s clique is simply moving in on DevOps teams that prefer to outsource the resolution of infrastructure issues. In fact, working with clusters requires engineers who understand the architecture of the implemented solution and know how to operate it. At DevOpsProdigy, we once taught a client — mass media platform Republic — to work with K8s. They were happy, we were happy. It was honest. Most often, however, K8s promoters take the client's infrastructure hostage. While they know the ins and outs of the container system, the client's team doesn't know beans about it.
Now, let's imagine that the outsourced DevOps engineer receives access to not only the web server, but also DB maintenance. Remember, the DB is the core of any project. Losing it will be fatal for any living species. The prospects are far from positive. So, instead of giving in to K8s hype, most teams had better use a good AWS package, which will solve all load balancing problems with their site or project. Here, I expect somebody to respond that AWS is no longer cool enough... Well, there are show-offs everywhere, including tech industry.
Perhaps clustering is indeed necessary for some projects. While there will be no concerns about stateless applications in that case, clustering the DB and the following organization of a decent network connectivity for it raises a lot of questions.
A seamless engineering solution like K8s will still cause some headache, i.e. data replication in a clusterized DB. Some DBMSs are originally quite loyal to the distribution of data among its separate instances. Many others are not that friendly. So, quite often, the ability to replicate with minimal resource and engineering costs stops being the main argument when choosing a DBMS for our project, especially if it was not originally planned as a microservice architecture project, but simply turned out that way.
And, speaking of virtual file systems: unfortunately, we can't call Docker Volumes problem-free. In general, reliable long-term data storage requires the simplest technical schemes. Adding a new abstraction layer from the container file system into the host file system is risky enough as it is. When, however, there are also problems with transmitting data between these layers, it's big trouble indeed. The more complex the process is, the easier it gets destroyed.
Considering all these issues, it's much more beneficial and easier to keep the DB in one place, even if you need to containerize your app. Let it run on its own and have a simultaneous connection to the DB, which will be read-written once and in one place, via a distribution gateway. Such an approach reduces the risk of mistakes and unsynchronization to naught.
To sum up, DB containerization is appropriate only where there is a real need for it.
If you’re looking for a black-and-white conclusion about whether to virtualize your DB or not, I regret to say there is none. When creating any infrastructure solution, you should follow common sense, not hype or tech innovations.
There are projects that perfectly incorporate K8s principles and tools; such projects find harmony at least in the backend part. There are also projects that need normal server infrastructure, not containerization. The reason is that they can't re-scale to the microservice cluster model — they will simply fail if they do so.