DevOpsProdigy Blog

The ETL process development cycle

Ivan Khozyainov — Wed, 31 Mar 2021 07:29:54 GMT

Any modern technology-based business generates an enormous amount of data on a daily basis. This data might be clearly visible to decision makers or stay hidden from them. In any case, analyzing this data helps you see what's going on and make better decisions. Existing data analysis tools include business analytics platforms, machine learning tools, and AI-powered analytic tools.

To see and analyze your data, you have to fetch it from its source, process it, and put it in some data storage system first. This process is known as ETL — extract, transform, load.

Companies usually perform ETL using specialized software that allows for scaling in order to accommodate a growing volume of data. Processing large datasets requires a cluster that can run ETL processes in parallel with workloads adjusted to network, disk space and CPU capacities.

In this case, creating an ETL process from scratch isn't reasonable. It's easier to take a framework with embedded scalability, then supplement it by writing and debugging your own code that carries out data processing logic tailored to your requirements.

ETL incorporates several very different components: the data source, the method of data transportation, the processing logic, and the data storage.

In order to verify that the processing logic is working correctly, it is strongly recommended to test the data processing component together with the whole component chain (data extraction, transportation, processing, and storage)as opposed to testing it alone. In this case, it's very important that developers don't forget about any intermediate tests when checking their code for errors — otherwise, poor-quality code will make it to the production environment and cause problems that might be very difficult to fix.

That's why it's better to automate this process as much as possible so that the developers don't have to check all code changes manually. Without automation, overlooking errors is simply a matter of time.

Besides that, developers also need to have access to an environment with all the chain components that they might need to refer to when fixing errors. Given the diverse nature of the chain components and their scaling capabilities, it's difficult or outright impossible to deploy such an environment on a developer's machine.

CI/CD is a pipeline with consecutive phases that include building, testing, and deploying code and, later, code changes, in a production environment. CI/CD consists of two stages: continuous integration (CI) and continuous delivery (CD). At the CI stage, developers work on code, then it’s tested and prepared for the next stage. At the CD stage, the prepared code changes are deployed in the target environment.

Each step within this pipeline starts after the previous step has been successfully completed, and can be performed with various tools.

With a CI/CD pipeline, you can automate the steps that would benefit from it, and after these steps are completed, the pipeline deploys code changes in the testing environment. After code changes have been approved, they can be deployed to the production environment. This pipeline concept ensures that the requirements for the development of an ETL process will be fulfilled.

Tasks that require ETL

The success of a business that has deeply integrated IT into its structure depends on the efficiency of its data collection and data processing methods. Automation, monitoring, and prompt response are valuable in many industries — for example, manufacturing, service, and maintenance of smart infrastructures.

One of the tasks that make companies build ETL processes is data collection from IoT sensors. Collected data often has to be prepared in some way before it can be loaded into a storage system. This step is necessary because of the large volumes of collected data, the necessity to improve data quality in order to use it for machine learning, and the variety of data formats and the ways different IoT devices represent data. In the case of IoT, improving the quality of collected data and processing it are especially important.

Another good example of a task that requires ETL is collecting geolocation data from public transport vehicles with respective sensors (this might be a part of implementing a smart city project). In this case, analyzing geolocation data alone isn't of any use — it must be analyzed in the context of timetables, routes, employee work schedules, and other information that geolocation sensors can't provide. An ETL process lets you enrich collected data with additional information necessary for proper analysis.

Besides these examples, tasks that require an ETL process might arise in any type of business that relies on the results of data stream processing — and the overall number of such tasks keeps growing.

Transferring, processing, and storing data: Kafka, Spark, and Greenplum

There are many potential data sources, including information systems and environmental sensors. This means that data can be collected from various CRMs, EDM systems, financial services, and sensors in manufacturing facilities. Each data source can represent data in its own way and be geographically removed from the analytics server. This is why transferring and processing data is a necessary step.

Transferring data between different systems is a separate task that requires specialized solutions. One of the most popular existing solutions that lets you collect and transfer messages is Apache Kafka. It’s scalable, allows for regulating the throughput capacity by adding new nodes, and prevents data from being lost in transit by replicating it. So, we suggest using Apache Kafka for transferring data.

Processing data and implementing ETL processes can be quite resource-intensive operations, especially if you deal with large volumes of data. That's why a good data processing system should be scalable. One of the popular solutions that fit that requirement is Apache Spark. You can easily integrate it with Apache Kafka or other data sources. Apache Spark also has modules for additional data processing tasks, for example, MLlib for machine learning and GraphX for processing graphs, which helps with preparing data for analytics. So, in addition to Apache Kafka for transferring data, we suggest using Apache Spark as a data processing framework.

After transferring data, you need to put it in a long-term storage system. For storage, we suggest using Greenplum, which is a DBMS that can handle large volumes of data. This platform is based on PostgreSQL and allows for horizontal scaling of data storage servers. Thanks to its massively parallel processing architecture, Greenplum fits the requirements for working with machine learning, business intelligence, and other analytical tasks.

The three systems suggested by us — Kafka, Spark, and Greenplum — are open source, have served as a basis for some large software projects, and have general documentation and documentation for developers.

Figure 1. The data flow through the components

An infrastructure for CI/CDimplementation

This data processing pipeline includes separate nodes for data generation, transfer, processing, and storage. Development and debugging of the data processing node might involve the other nodes as well, and there must be at least two separate environments in order to ensure a reliable development process. These two environments — the production environment and the stage/development environment — should consist of the same components, which helps you minimize any potential side effects and improve code quality. Automating deployment and testing also requires an infrastructure: a code repository, a build system, and systems for testing and static code analysis.

For software development teams, GitLab, a repository management system based on the Git version control system, is a well-established solution. It can be integrated with other CI/CD systems and also provides its own tools for implementing CI/CD. The wide range of accessible tools and modules lets you organize a convenient and flexible development process.

Organizing a process for building, testing, and deploying code is a separate task. There are several existing solutions for it, both proprietary and open source. A good choice is Jenkins, a popular and open-source system, which can be extended with additional modules and integrated with development and testing tools.

Errors in code can be avoided with automated source code checking — also known as static code analysis. It's recommended to implement it as a separate testing step in the build stage. Of all the static code analysis solutions out there, the SonarQube open-source platform seems to be the most suitable choice. It reviews code automatically, and if it encounters a potential error in a line of your source code, it adds a comment there. It can be integrated with Jenkins as a module, so any code changes that, according to SonarQube, contain errors can be discarded automatically, which improves the overall code quality.

Testing is a very important stage for verifying whether the data processing logic is correct. Unit tests, prepared in advance, help you check whether the business logic of your system works properly. Besides unit testing, integration and load testing can be useful as well, because these tests show how your system works in conditions similar to those of the production environment. A good tool for generating load and performing integration testing is JMeter, an open-source system. It can be integrated with Jenkins as a module and is suitable for assessing the performance of an ETL task.

The systems that we mentioned have web interfaces for visualizing the respective resulting information about builds and localizing errors quickly. We'll describe these systems in more detail below.

Figure 2. Components involved in the development of an ETL task

GitLab for managing code and repositories

GitLab is an open-source product, which is currently actively developed and maintained. According to its development team, GitLab is used by over 100,000 organizations, which is quite believable.

Thanks to being open source and providing maintained security mechanisms, GitLab is a good fit for a highly secure internal environment. GitLab supports two-factor authentication and single sign-on. Another advantage of GitLab is that developers who already store their projects in public repositories on GitHub can easily transition to using GitLab because of their similar interfaces.

GitLab offers different tools to make software development convenient and efficient: there are tools for issue tracking, planning, discussions, collaboration, and visualizing repository branches, and also a source code editor. You can find the full list of GitLab's features on the project's website. This platform also has a system of access privileges and repository user roles.

Figure 3. GitLab UI

Jenkins for setting up a CI/CD pipeline

Jenkins is an open-source software development automation system. Currently this project is actively developed and maintained. Jenkins can be used in an internal highly secure environment thanks to being open source and providing security modules (similarly to GitLab).

You can use existing plugins to integrate Jenkins with other systems and also create new plugins to use in your own projects. There are over 1,500 plugins created by the Jenkinscommunity, which extend the platform's functionality and improve it in different ways — this includes fine-tuning UI and usability, supplementing Jenkins with additional build systems and deployment tools, and so on.

Thanks to its modular structure, a build and delivery pipeline in Jenkins can use different systems required in the software development process. The integration with GitLab provides automatic code change checking, which helps ETL task developers find potential errors quicker, and this lets them make ETL tasks more stable and effective.

We suggest storing a configured CI/CD pipeline as source code, because developers can create pipelines for similar ETL tasks faster by using the existing pipeline code as a basis. Each pipeline stage is visualized in the web interface, so developers get feedback on the performance of each stage and can quickly localize potential issues.

Figure 4. Jenkins UI showing the build process

SonarQube for static code analysis. Unit testing

In ETL task development, ensuring code quality is very important — when processing large volumes of data, the cost of an error can be very high. Corrupt data or lost data packets can decrease the overall quality of stored data, which then can't be used in analytics and loses its value.

Before starting the development of an ETL task, we recommend creating unit tests that cover edge cases, so that you can ensure proper implementation of the business logic. These tests will help you define the boundary conditions of your ETL task and avoid errors in its development. Unit testing is included in the CI/CD pipeline as a separate stage.

To check source code automatically, we use the SonarQube open-source platform. Like Jenkins and GitLab, SonarQube can be used in highly secure segments of a corporate network.

SonarQube integrates with Jenkins, so code can be built and then analyzed within a Jenkins pipeline. Based on the results of this analysis, the pipeline either moves on to the next stage or stops, and the code is sent back to the developers for fixing errors.

Figure 5. SonarQube UI

After analyzing the changes in the code of an ETL task, SonarQube generates a report on the quality of the code. This report shows where the tool found dead code, duplicate code, and potentially unsafe code. There's also an option to configure the tool by using specific rules that define code quality in your case.

JMeter for assessing the data processing capabilities of an ETL task

Besides checking the code quality of your ETL task, you also have to check how well it integrates into your data processing pipeline. In order to do that, you must create a load profile similar to what you expect in the node's regular operational conditions. You can also perform load testing with stress tests.

JMeter is an open-source tool that provides functionality for both integration and stress testing, so you can use it to assess the throughput capabilities of your system.

It integrates with Jenkins and can generate reports on the performance of the tested system. If you also use Apache Kafka to transfer data, reports can show you if there's a need to scale or reconfigure it. You can measure the throughput by adding debugging information into your data processing node or by sending analytical queries to your data storage.

It's recommended to make integration testing a separate stage in your CI/CD pipeline so you can identify any issues with the performance of your ETL task in conditions similar to those of your production environment.

Figure 6. JMeter UI

Figure 7. Taurus, a command-line tool for JMeter

Identifying potential issues found at different stages of the CI/CD pipeline

A CI/CD pipeline developed according to our recommendations lets you get debugging information from the software development automation system. After a developer implements some part of the business logic into the ETL task code, they can publish the code changes to the dev branch in their organization's Git repository. Then the CI/CD pipeline starts automatically building, testing, and deploying the published code changes.

This pipeline makes it possible to quickly diagnose and localize infrastructure issues that hinder the performance of your ETL task. Below, we'll show you how to read a Jenkins report and, in case of a failed build, quickly figure out what the problem is.

For example, in Figure 4, which illustrates how a Jenkins CI/CD pipeline works, you can see how the pipeline consecutively moves through the following stages:

1. Getting code from a repository

Errors at this stage can be caused by unavailable GitLab services. These types of issues don't have anything to do with the code itself, so they should be escalated to your infrastructure support service and DevOps engineers.

2. Building code

At this stage, you can identify issues with the building process: errors in source code syntax, missing packages, incorrect project structure, and any other issues that a build tool (such as Gradle, sbt, or Maven) can find. These issues are logged in the build history, and you can view them via the Jenkins web interface.

3. Starting static code analysis

Note:If the build tool starts static code analysis, then we recommend putting this stage after the build stage. You can also merge these two stages into one.

Errors at this stage can happen if static code analysis is performed before the build stage (see stage 2), and then issues arise at the build stage. If static code analysis comes after the build stage, issues may be caused by an unavailable SonarQube service. Just like at the first stage, such issues should be escalated to your infrastructure support service and DevOps engineers.

4. Performing unit tests

Errors at this stage are related to the business logic of your ETL task, so your ability to find these errors depends on the quality of your unit tests. Test results and errors are logged, and you can view them via the Jenkins web interface.

5. Assessing the results of the static code analysis using a quality gate configured in SonarQube

At this stage the static code analysis result is ready, and if the code doesn't satisfy your SonarQube quality requirements, you can view the result in the SonarQube web interface.

Figure 8. A failed quality gate returning ERROR in Jenkins

Figure 9. The results of static code analysis performed on a build that contains errors

6. Deploying the application into an Apache Spark test cluster

Errors at this stage can be caused by not configuring deployments properly and providing the system with incorrect information. Another possible reason is that your Spark resources are unavailable. When developers can't fix such issues by themselves, the issues should be escalated to your infrastructure support service and DevOps engineers.

7. Performing load tests on the ETL task

Errors at this stage don't signify issues with the code — they usually indicate that the Apache Kafka cluster is unavailable. That's why, like in stages 1 and 3, such issues should be escalated to your infrastructure support service and DevOps engineers.

Only after all these stages have been successfully completed, you can verify whether your ETL task is working as expected. You can assess its stability, compliance with the business logic, and capabilities by examining its performance logs in the Apache Spark web interface. You can also add more stages for automatic business logic testing to your CI/CD pipeline if you need to. The pipeline that we suggest is a basic one and consists of the stages necessary for the development of any ETL task, so it doesn't include additional stages that might be required in your specific situation.

Figure 10. ETL task performance logs showing a message with debugging information about successful delivery of all JMeter messages

An example of a software development cycle based on our recommendations

With the infrastructure described above, companies can develop and test ETL tasks using a CI/CD pipeline. So, if we formalize our cycle for developing a stable ETL task, it will look like the flowchart below:

Figure 11. The ETL task development cycle

After you’ve decided what kind of ETL task you need, create tests for checking whether your ETL task works properly in terms of the business logic. If necessary, you can set up two build pipelines — one for a testing environment and one for the production environment. Use two branches in our GitLab repository: dev and main. Each branch is connected to the corresponding build pipeline: main to the one for the production environment, and dev to the one for the testing environment.

During development, developers commit code changes to the dev branch and, after they are automatically built, assess the code quality. This way, developers get quick feedback about potential issues. If there are no errors found in a build, the author of these code changes can push them to the main branch (if they have permissions to do that in GitLab). After that, the ETL task is built and deployed in the production environment.

Let's go through this process from a developer's point of view.

1. We change the source code of the ETL task (the testMethodForSonarQube method):

Figure 12. The section of the ETL task code that we're working on

2. We commit our code changes to the dev branch:

Figure 13. Committing code changes using the Git version control system

3. After the build process is finished, we can see the results — in this case, the static code analysis tool found an error:

Figure 14. Pipeline stage results in Jenkins showing an error found during static code analysis

4. We look at the static code analysis report and see what exactly we need to fix and where it is:

Figure 15. The report from the static code analysis tool

5. After fixing the issue, we get a successful build. At this point, we can consider using this ETL task in the production environment:

Figure 16. A successful build in a Jenkins CI/CD pipeline

Conclusion

As we've shown above, the development of an ETL task can be simplified by using automation and a CI/CD pipeline for building, testing, and deploying code.

A CI/CD pipeline can be based on open-source software, which makes it possible to use these systems in highly secure segments of a corporate network. Some components of this architecture (Jenkins, GitLab, and SonarQube) also have built-in security mechanisms.

All the recommended components easily integrate with each other. This architecture can also be extended with other systems (for example, BI and/or ML tools).

Automation and the approach suggested by us can help companies transform their business processes and make them more flexible by improving code quality and accelerating their ETL task development cycles. By following our recommendations, organizations can quickly and routinely get even more information from new data sources.

Why businesses want DevOps, and what DevOps engineers need to know to communicate with them effectively

Evgeny Potapov — Tue, 10 Nov 2020 05:15:34 GMT

Over the last few years, we’ve taken every possible opportunity to talk about what DevOps is. It may seem like it’s getting tedious, but the fact that the DevOps discussion is still going on is quite telling: there are still unresolved issues. And these issues lie in communication between businesses and DevOps engineers.

I often see how people coming to DevOps from different backgrounds each have their own definition of DevOps and basically speak different languages. At a certain point, it turns out that the stakeholders of a DevOps transformation project don't understand each other, and don’t even understand why they need DevOps at all.

In this article, I'm not going to talk about what DevOps is and what's the right way to define it. Instead, I'll focus on the evolution of IT processes, what businesses want to gain from implementing DevOps, what that entails for DevOps engineers, and how we can bridge the gap between us.

Looking back at the journey we’ve gone through with our customers, I can see how business requirements have changed over the course of the last few years. We have been providing maintenance services for complex information systems since 2008, and at first our customers mostly wanted us to make their websites fault-tolerant. But now they have some radically different requests. Back then, the most important things were stability, scalability, and a resilient production environment. Now, fault tolerance of development platforms and deployment systems is of equal importance.

Large companies have experienced a value shift: a stable development environment has become just as important as a stable production environment.

To better understand how we got here and how our mentality has changed, let’s briefly go over the history of software development.

Some history of software development

I think the evolution of software development principles can be roughly divided into four stages. An important thing to note is that software delivery has existed at all stages, although who was responsible for it has varied throughout its evolution.

Mainframes: 1960s - 1980s

Characteristic features of this stage: From the point when computers had just appeared and up to about 1985, companies developed software exclusively for internal use. Small development teams delivered software with rather limited functionality, which was tailored to the demands of the specific company or government-related organization where it would be used. That limited functionality might be intended for sending people to the Moon, but in comparison to modern services, it doesn't have many use cases.

Users: The employees of the company where the software was developed. Back then, the number of software users was also very limited — for example, 3 Apollo astronauts or 20 people who calculated a government budget, or 100 people who processed population census results.

Software distribution: Via physical media and mainframes. Companies had to produce punched cards, then put them into a computer, and in about 10 minutes the program was ready for use. Software delivery technically was the responsibility of the person who entered the data onto punched cards. If a developer made a mistake, fixing it took a lot of time because that required rewriting and debugging the code, producing new punched cards, and inserting them into machines. All that took days and meant that many people simply had wasted their time. So, mistakes had very negative consequences and sometimes could even result in a disaster.

At this stage, IT as a business didn’t really exist yet. Wikipedia lists only four software development companies founded in 1975. One of them was Microsoft, but back then it was a very small and niche company.

PCs and OOP: 1980s - 1990s

Things started to change approximately in 1985, when personal computers became quite common: Apple Computer, Inc. started manufacturing the Apple II in 1977, the IBM PC was released in 1981, and a bit earlier, DEC minicomputers had gained significant popularity.

Characteristic features of this stage: Software development was turning into a business. The number of users had grown, and that made creating software for sale possible.

In 1979, for example, the first spreadsheet software, called VisiCalc, was introduced. It took on some calculation tasks previously performed by an accountant (this role has now been transferred to Excel). Before that, an accountant entered numbers into a big table on paper and performed calculations using different formulae. If an analyst asked what would change if the revenue in the third quarter would be twice as high, the accountant had to change one value and perform the same calculation again — all on paper.

Users: Other companies. VisiCalc completely transformed the computer industry. Now software was developed for the mass market instead of a specific group of users with specialized requirements. For example, economists and analysts started to buy computers in order to leverage electronic spreadsheets.

Because there were more potential users, and software could be sold to individuals as well as companies, developers had to figure out how to make their software work for a large user base and how to create such complex software in general.

The growing number of users made it necessary to expand functionality. That required expanding development teams as well — a dozen developers wasn't enough anymore. Working on a complex software product required a 100- to 500-person team.

Interestingly enough, each stage has some key books that caused revolutionary paradigm shifts in IT. I think for that stage — when software development as a business began to take hold and development teams started growing in number — those books were The Mythical Man-Month: Essays on Software Engineering and Design Patterns: Elements of Reusable Object-Oriented Software. At this time, two things became clear. Firstly, if you increase the number of developers in a team by four times, it doesn't mean you'll get the result four times faster. Secondly, there are other possible solutions to the scaling problem.

A popular way to deal with the growing complexity of software was object-oriented programming. The idea was that if you took a large application, such as Microsoft Excel, and split it into separate objects, development teams could work on them independently of each other. By dividing the product into parts based on functional elements, you could scale and, as a result, accelerate the overall development of the product. Keep in mind that back then, accelerating the development cycle usually meant reducing it to several years.

The reasoning behind OOP sounds a lot like the reasoning behind microservices. However, at that stage, we still packaged applications into a single file (an .exe file during the reign of Microsoft DOS and, later, Windows) and then delivered it to the user.

Software delivery: Via physical media. At that stage, when software started to be mass-produced, the delivery process consisted of writing your software to floppy disks, labeling them with stickers showing the software name, packing the floppy disks into boxes, and sending them to users in different countries. Also, the number of defective floppy disks needed to be kept at minimum. After all, if we manufacture floppy disks in America, deliver them to Russia, and only then find out that half of them are defective, that means huge losses for the business, and our customers will leave us once and for all.

The cost of a mistake: Customers would demand their money back and would never buy from the same company again, which might ruin the whole business.

The software development cycle was terribly long, because each stage lasted several months:

● planning — 12 months

● development — 24 months

● testing — 12 months

● delivery — 12 months

New software versions were released once every few years, so making mistakes in the code was unacceptable.

The main risk factor, however, was that you couldn't get any user feedback throughout the whole software development cycle.

Just imagine: we've got an idea for an IT product, so we do some tests and decide that users might like our product. But we can't really make sure that it will succeed! We can only write code for two years, then ask some nearby accountants (for example, in Redmond) to install our software (for example, a new version of Excel) and try it out. And that's all we can do to test our product. It might very well turn out that nobody wants it and we've wasted the whole two years.

Or it might turn out that people buy our software, but it’s buggy and doesn't work properly — and because at this stage applications are still physical products that come in boxes which you can bring back to the store, users can easily return our product and decide never to buy any software from us again.

Agile: 2001 - 2008

The next stage came with the adoption of the Internet by the masses in the 2000s.

Characteristic features of this stage: IT businesses were moving to the Internet, but browsers couldn't do much yet.

Microsoft created Internet Explorer, which was provided to all Windows users for free. A huge number of people could now access the Internet. Nevertheless, Microsoft intentionally hadn't optimized Internet Explorer for using dynamic functionality in order to protect their software from competition — e.g., browser-based apps and Netscape (you can learn more about that by reading about the browser wars). So, the Internet was mostly used for downloading files, but that was enough to make businesses move there.

Software delivery: Users could now get software distributions from the Internet.

This made it possible to release updates and new software versions much more frequently — once every few months. Companies didn't have to write software to floppy disks or CDs anymore, because users could download updates from the Internet, and developers could allow themselves to make more mistakes.

The cost of a mistake: The risk for the business was not that high because users could install an update and keep using the software.

Agile emerged at about the same time, so this stage saw the release of certain books on agile software development and extreme programming that are still considered IT management 101: for example, Extreme Programming Explained: Embrace Change, as well as Refactoring. Improving the Design of Existing Code and Test-Driven Development.

The main idea was that, because companies could now deliver software via the Internet, they could shorten the development cycle and release new versions once every six months.

The software development cycle in the beginning of the 2000s looked somewhat like this:

● planning — 2 months

● development — 6–12 months

● testing — 1–3 months

● delivery — a few weeks.

For one thing, rigorous testing wasn't as important as before. Even if 10 percent of users could encounter bugs, it was easier to release a patch rather than spend a year on making sure that the software worked properly for absolutely everyone. This way, companies could also test their hypotheses faster (although in this case faster meant 6–12 months).

Moreover, by spending less on testing and thorough planning, companies could cut costs on these experiments. And experimenting became a key idea of the next stage.

DevOps: 2009 - 2020

Characteristic features of this stage: Installing software is a thing of the past, and any software that needs to be installed is updated via the Internet. The Internet is everywhere. Social networks and entertainment apps that are accessed exclusively through the Internet are gaining popularity. We can now implement complex dynamic functionality that runs in a browser, so businesses take advantage of this opportunity.

Software delivery: Via the cloud. In the previous stage, software was installed on a user's computer, so it had to be adjusted to that environment. Now, we can adjust it to a single computer — our server in the cloud. This is very convenient for us because we have full control over this computer and how our apps run on it. There might be some difficulties with rendering interfaces in a browser, but they aren't too much of a problem anymore in comparison to the issues of the past.

All of that helps us accelerate planning, implementation, and testing. Now, we don't have to be left in the dark for months or even years when it comes to knowing whether our project will make it or not, what functionality users want, and so on. Updating software is possible in almost real time.

Still, in 2006 - 2008, software was developed using the same ideology — an application was regarded as a single entity. While it wasn’t an .exe file anymore, it was still closer to a monoliththat consisted of several closely connected objects. Such software was too unwieldy to be quickly adapted to the changing market.

In order to solve this problem, the same people who brought us OOP suggested splitting applications as well, so that software would be made up of separate apps that communicated with each other. Then it would be possible to expand development teams even more, going from hundreds to thousands of team members, and create new functionality continuously. This would let companies experiment more, test hypotheses, adapt to the market requirements and the behavior of the competition, and keep growing their businesses.

In 2009, the world saw the first presentation on uniting Dev and Ops in order to deploy code 80 times a day. This unity became one of the main values in software development. The development cycle looks completely different now:

● planning — a few weeks

● development — a few weeks

● testing — a few days

● delivery — a few minutes

We can almost immediately fix mistakes and quickly develop new hypotheses. This was also the stage where MVP, now a well-known term, was introduced.

While in the 1970s, developers had almost no room for mistakes, and software was practically immutable (you don't really need to change requirements every time you send astronauts to the Moon), now it is absolutely dynamic. Everyone expects to find bugs in their software, so we must provide IT support and have a team who makes sure that your system works properly regardless of the dynamic changes within it.

In this new stage, for the first time in the history of IT, the software delivery role has truly become an IT job.

Before that, the person responsible for software delivery wasn't considered an IT employee. From the 1970s to the 1980s, this job consisted of simply inserting punched cards into computers, while from the 1980s to the 1990s it was about negotiating with CD manufacturers and taking care of logistics. All of that had nothing to do with software development or system administration.

Those who aren't very familiar with DevOps often think that "DevOps engineer" is just a hip new name for an administrator who’s more involved with the developers. But in business (and in Wikipedia too), DevOps is a methodology that is applied in software development. However, it's not the definition that's the most important — it's what DevOps gives us. And DevOps as a methodology lets us adapt to the changing market ASAP and restructure the way we develop software.

If a business doesn't want to lag behind its competitors, it must get rid of long development cycles with monthly releases and adopt DevOps instead. DevOps transformation here means a complete shift to Agile, from development to deployment. And this is how software delivery becomes a part of the software development process and turns into an IT job.

Because software delivery is connected to handling servers and infrastructure, it seems that this job is better suited for someone with administrator experience. But in this case, we end up with communication problems between DevOps engineers and the business. This is especially true if we're talking about administrators who take part in the DevOps transformation and try to meet the needs of the business, which wants to be more flexible.

Most administrators responsible for fault tolerance have a mantra of "If it works, don't touch anything!" Although their company doesn't launch any rockets, they have the same mentality about stability. But in the new, dynamic environment of today's world, businesses want (regardless of potential malfunctions):

● to go from an idea to a deployed product in a minimum amount of time;

● to test a maximum number of hypotheses in a short time;

● to minimize the impact of errors on production.

Even if something crashes, it's not a problem — we can roll back, fix the cause of the problem, and deploy again. It's better to quickly evaluate our product's chances for success than to invest in something that won't be in demand.

The approach to fault tolerance is changing: we no longer need the current version of our software to remain stable for a long time — we just need to reduce the impact that any errors in the current version might have on the performance of the whole system.

Instead of making sure that every little bit of added code is stable, we should be able to quickly discard unstable code and go back to stable code. This, too, is about flexibility: the value is not in the stability of the code deployed in the infrastructure but in the capability of the infrastructure to be extremely flexible.

What exactly can DevOps engineers do to help businesses?

So, how can we better connect with a business and its values?

Since we're implementing DevOps to meet the needs of the business, we must know whether DevOps engineers do what the business needs them to do. In order to do that, we can implement the following metrics (taken from DORA's State of DevOps Report):

● Deployment frequency — how often you deploy code to your production environment or how often your end users get new releases of your product.

● Lead time for changes — how much time passes from committing code to the repository to deploying it in the production environment.

● Time to restore service— how long your service takes to recover from a failure or crash.

● Change failure rate — what percentage of deployments result in worse user experience and require fixing new issues, for example by performing rollbacks.

These metrics will help you evaluate how efficiently your company leverages DevOps. More than that, DevOps engineers can use them to understand what steps to take in order to help the business.

Deployment frequency

Obviously, the more often a company deploys code, the closer it is to embracing the DevOps transformation. But frequent deployments are scary. However, a good DevOps engineer can help the business to overcome these fears.

Fear #1:We might deploy code that hasn't been tested properly, and then our production environment will crash under the load.

The job of the DevOps engineer: Provide an easy way to roll back and help automate testing in the infrastructure.

Fear #2:We might deploy new functionality that has bugs in it, and implementing it will change the data structure or the data itself so much that a rollback won't be possible.

The job of the DevOps engineer: Cooperate with developers — help them with architectural decisions, suggest effective data migration methods, and so on.

Fear #3:Deployments are complex and take a lot of time. (Note: Our experience tells us that Docker images that take 20 minutes to build are quite a common occurrence.)

The job of the DevOps engineer: Find a way to make deployments and rollbacks fast and speed up the build process.

Lead time for changes

This metric is useful for managers as well — after all, it's a manager's job to organize a workflow where code written by developers is committed and deployed ASAP. But DevOps engineers can help with solving challenges in the organization of such a workflow.

Problem #1: Too much time passes between creating and merging pull requests, for example because not only the pull requests themselves are reviewed, but the submitted reviews are reviewed as well. The root of the problem here lies, again, in the hesitation to deploy code.

The job of the DevOps engineer: Together with the development manager, consider automatically merging pull requests.

Problem #2: Manual testing takes too long.

The job of the DevOps engineer: Help automate testing.

Problem #3: The build process takes too long.

The job of the DevOps engineer: Monitor how much time the build process takes and try to reduce it.

In order to do that, the DevOps engineer must, first of all, understand how software testing works, how to automate it, and how to integrate automated testing into the build process. Second, the DevOps engineer should break down the software deployment pipeline into its individual components and try to figure out which parts can be optimized in terms of speed. You can monitor the whole process, from committing code to deploying it in production, find out how long the build process takes and how long it takes for a pull request to be approved, and then, together with the manager and the developers, figure out where you can save time.

Time to restore service

This metric actually has more to do with SRE.

Problem #1: Locating technical issues is very difficult.

The job of the DevOps engineer: Ensure observability and, together with the developers, set up a monitoring infrastructure and configure the monitoring system to effectively inform you about the performance of your service.

Problem #2: Currently, the infrastructure doesn't allow for easy rollbacks.

The job of the DevOps engineer: Make the necessary changes to the infrastructure.

Problem #3: Migration made performing rollbacks impossible.

The job of the DevOps engineer: Teach developers best practices for fault tolerance as well as for data migration that enables easy rollbacks.

Change failure rate

This metric is also from the domain of management. However, here's a fun fact: failures happen more often if deployments are infrequent.

Unfortunately, I often see how companies decide to start a DevOps transformation and implement Kubernetes and GitOps, but all that doesn't have any impact on their release frequency. Their approach stays the same, so if developing a new version of their product took six months before, it still takes six months now. And when you're writing code that takes months to reach the production environment, such code is much more likely to fail than code that's deployed weekly. This mentality undermines the whole DevOps transformation — if a company wants to adopt DevOps but their development cycle takes six months, that's a big problem.

In this situation, the DevOps engineer must sound the alarm and try to explain to the business, once again, what DevOps is about and how the approach to software development has changed over the last few years.

Making expectations reality

DevOps engineers need to have a clear understanding of what the business needs and work on fulfilling these needs. Here’s what you should keep in mind:

● The stability of the current version isn't as important as, firstly, the stability of the infrastructure in general and, secondly, the ability to roll back to the previous version in case of a failure, isolate the issue, and fix it quickly.

● The stability of the development environment. Its efficiency is critically important, especially if there are hundreds of developers on the team. When developers have to stop working because of issues with the development environment, it's just as bad as downtime in a factory.

● Monitoring the software delivery process is now a part of monitoring the whole infrastructure. If something takes 20 minutes to deploy, try to accelerate it.

● Software delivery speed has become one of the key areas for improvement — ideally, you should have a highly efficient pipeline that works without a hitch.

● A convenient development environment is another key objective. If developers can use the environment without any troubles, they write code faster, deploy it more often, and the quality of the code is better overall.

How we chose the right time series database for us: testing several TSDBs

Dmitry Chumak — Thu, 22 Oct 2020 05:37:36 GMT

Over the course of the last few years, time series databases have evolved from a rather uncommon tool (intended for special purposes and used either in big data projects or in open-source monitoring systems, in which case it was bound to specific solutions) into a more conventional one. Previously, if you wanted to store a lot of time series data, you had two options: resign yourself to deploying and supporting the complex monstrosity that is the Hadoop stack, or deal with protocols specific to each system.

Some might think that any recently written article about choosing a TSDB should essentially be just one sentence: "Use ClickHouse." But it's not that simple, actually.

Yes, ClickHouse is being actively developed, its user base is growing, and the project is supported by many enthusiastic developers. But could it be that the apparent success of ClickHouse has blinded us to the potential of other (perhaps more effective or reliable) solutions?

In the beginning of 2018, we started overhauling our own monitoring system, and we faced the issue of choosing an appropriate database for storing our data. Now I want to share with you how we made that choice.

Defining the requirements
Let's start with a little preamble. Why did we need our own monitoring system and how did we set it up?

We started to provide IT support services in 2008. By 2010, it was obvious to us that aggregating data about the processes in a customer's infrastructure by using the solutions available back then had become too difficult (I mean Cacti, Zabbix, and Graphite, which had been released not too long before).

Our main requirements were:

● a single system with the capacity to support at first dozens and eventually hundreds of customers, at the same time providing centralized notification management;

● a flexible notification management system, allowing us to escalate notifications from one employee to another, adapt the system according to the team's schedule, and access a knowledge base;

● the capability to create detailed interactive charts (back then, Zabbix generated graphics in the form of images);

● long-term (one year and longer) storage of large volumes of data and a fast data fetching mechanism.

In this article I'll focus on the last requirement.

This is what we wanted from our storage:

● it must be fast;

● preferably with an SQL interface;

● must be stable, with an active user base and ongoing input from developers (we once had to maintain MemcacheDB, which had been practically abandoned, and MooseFS, where bugs were submitted in Chinese. We don't want to repeat that experience.);

● should belong to the CP category of the CAP theorem: it must provide Consistency (our data must always be up-to-date, because we don't want the notification management system to send alerts to all projects when it doesn't receive new data) and Partition Tolerance (we don't want any split-brain issues). Availability is not a priority if we enable active replication — in case of emergency, we can manually switch to the redundant system through code.

Surprisingly, the best option for us back then turned out to be MySQL. Our data structure was very simple: the server id, the counter id, the timestamp, and the value. We could fetch both hot and historical data quickly due to the large size of the buffer pool and the speed of SSDs.

This way, fetching new, two-week-old data where the time is detailed to seconds and displaying it completely took 200 ms. After we managed to achieve this, we were happy with our system for a while.

But as time went by, the volume of our data grew. By 2016, it amounted to tens of terabytes, so renting SSD storage became quite costly.

By that time, columnar databases had gained popularity, and we started considering that option. In this type of database, data is stored, obviously, by column, and as you can see in the picture below, our data tables included many duplicate values — if we used a columnar database, they would be compressed into a single item.

However, the existing system was business-critical and stable, so we didn't feel inclined to experiment with other solutions.

It was probably at the Percona Live Conference 2017 in San Jose where the Clickhouse developers introduced their project to the world. At first glance, the system seemed to be production-ready, maintaining it was easy and didn't take much time, and operating it was simple as well. That’s why, starting in 2018, we began our migration to Clickhouse. However, many mature and established TSDBs had appeared by then, so we decided to take more time and compare all the options in order to make sure that Clickhouse was the best choice for us.

Besides the previously mentioned requirements, we now had some new ones:

● our new system should be at least as efficient as MySQL on the same amount of hardware;

● the storage should take up less space than that of the existing system;

● the DBMS should be easy to use;

● switching to the new system shouldn't require making too many changes to our monitoring application.

Here are the systems we were considering:

Apache Hive/Apache Impala
A part of the old tried-and-true Hadoop stack. Essentially, it’s an SQL interface on top of data stored in its own file formats in HDFS.

Advantages:

● If you maintain it properly, scaling is easy.

● You can store data in a columnar format, which saves space.

● Parallel tasks are finished very quickly provided that there are enough computing resources.

Disadvantages:

● Being a part of the Hadoop ecosystem results in operational difficulties. If you're not ready to buy a ready-made cloud solution (which we considered too expensive), your administrators will have to put the whole stack together and maintain it themselves. We didn't want that.

● Yes, the speed of data aggregation is good.

However:

As you can see, this speed is achieved through adding more computing servers. In other words, if we were a large data analytics company, and a very high speed of data aggregation was a critical business requirement (even if we had to pay for a large amount of computing resources), Apache Hive/Apache Impala could be a viable option. But we weren't ready to drastically increase the amount of hardware for the sake of speed.

Druid/Pinot

This option is more along the lines of a time series database, but still associated with Hadoop. Here’s a great article comparing Druid, Pinot, and Clickhouse.

In a nutshell, Druid/Pinot are better than Clickhouse when:

● Your data is heterogeneous (in our case, however, we only recorded time series of server metrics, which comprise, essentially, a single table. However, there might be other use cases where we need to aggregate and process different time series with different structures: time series of hardware metrics, economic time series, and so on).

● At the same time, there is a lot of data.

● Tables and data including time series are not permanent (meaning that data sets are added, analyzed, and then deleted).

● There are no clear-cut criteria for data partitioning.

If none of the above applies to your situation (as in our case), you'd be better off with ClickHouse.

ClickHouse

● An SQL-like syntax.

● Easy to use.

● People say that it works for them.

So, it became a shortlist candidate.

InfluxDB

This is an alternative to ClickHouse. One of the downsides is that only the paid version provides high availability functionality, but we included it in the comparison anyway.
It became a shortlist candidate.

Cassandra

On the one hand, we know that such monitoring systems as SignalFX and OkMeter use it for storing time series of metrics. On the other hand, it has its specifics.

Cassandra isn't a conventional columnar TSDB. It looks more like a row database, but each row can have a different number of columns, so you can arrange your data to be represented in a columnar format. The 2 billion column limit allows for storing some data (like time series) specifically in columns. In MySQL the limit is 4096 columns, so if you try to do the same using MySQL, you're very likely to get a #1117 error.

The Cassandra engine is intended for storing large volumes of data in a distributed system without a master node. This database falls under the AP category of the CAP theorem, so it’s oriented towards availability and partition tolerance. As a result, Cassandra is a great tool if you mostly need to write to your database and read data from it only on rare occasions. It would make most sense to use Cassandra as cold storage — that is, as reliable long-term storage for large sets of historical data that you don't utilize very often but can fetch when necessary. However, we included it in the test to make it more comprehensive. There were some limitations, because, as I mentioned before, we didn't want to rewrite too much code to accommodate the new database solution, so we didn't adapt the structure of our database to Cassandra.

Prometheus

Out of curiosity, we decided to test Prometheus as well, so that we could see whether existing solutions were faster or slower than our own monitoring system.

Our testing methodology and test results

We tested 5 databases with 6 configurations: ClickHouse (1 node), ClickHouse (a distributed table stored across 3 nodes), InfluxDB, MySQL 8, Cassandra (3 nodes), and Prometheus. Our test plan ran as follows:

1. upload historical data collected over a week (840 million values per day, 208 thousand metrics);

2. generate a workload: continuously insert new data into the tested system (we tested the systems in six load modes, which you can see in the picture below);

3. besides inserting data, we also fetched data from time to time in order to simulate the actions of a user working with charts. To make things simpler, we were fetching data on ten metrics (the number of metrics in the CPU chart) collected over the course of a week.

So, we started inserting data, simulating the behavior of our monitoring agent, which sends values for each metric every 15 seconds. At the same time, we alternated between:

● the overall number of metrics where data is inserted;

● the time interval between sending data for the same metric;

● the batch size.

Now, a few words about the batch size: since individual inserts are not recommended for any of the tested systems, we need an intermediary that collects incoming metrics data, groups it and inserts it into the database in batches.

To make data interpretation easier, we'll think of this bunch of metrics as if they were organized by servers — 125 metrics per server. We’ll do it simply to illustrate that, for example, 10,000 metrics would correspond to 80 servers.

So, keeping all that in mind, here are our six load modes for inserting data into the database:

A couple of things to note. First of all, Cassandra couldn't handle these batch sizes, so we limited them to 50 and 100 metrics. Secondly, since Prometheus uses the pull model, that is, it pulls data from the sources of the metrics (and even Pushgateway doesn't change the situation, despite what its name might suggest), we used a combination of static configs when generating the workload.

Here are the test results:

As you can see, Prometheus can fetch data exceptionally quickly, Cassandra is horribly slow, and InfluxDB is way too slow. As for the insert speed, Clickhouse comes out on top, and we can't really test Prometheus because it inserts data within itself, so we can't measure the insert speed the same way as for the other systems.

All in all, ClickHouse and InfluxDB showed the best results, but building an InfluxDB cluster would require buying its Enterprise version, while ClickHouse is free. So, while many companies favor InfluxDB, we will stick to ClickHouse.

How I Didn't Make It to London but Still Attended the London DevOps Enterprise Summit

Evgeny Potapov — Wed, 22 Jul 2020 08:12:38 GMT

This year's DevOps Enterprise Summit was held between the 23rd and 25th of June in London — or rather "in London," because the event went virtual due to the pandemic. Online conferences have both negative and positive effects: on the one hand, networking takes a big hit, but on the other hand, you can take a break between presentations without having to hear the buzz of the crowd and also dedicate a few days focusing on presentations that you are particularly interested in, which is very convenient.

In a similar fashion, I recently attended two virtual meetups that were held on two consecutive days and formally took place in different cities hundreds of miles away from each other, and a few days later I participated in another meeting "in California."

As you can see, physically attending those events would’ve been very difficult. Sure, you could always listen to the recordings, but the feeling that you dedicated a specific time slot to learning in itself makes learning easier.

Conference reports have become an established blog post type, and the desire for self-development keeps growing stronger while you're stuck at home. But there are so many conferences — how can you see everything you want to see?

That's why I thought that I should do a slightly different conference report. Instead of sharing the contents of the presentations, I decided to write an article that would summarize the conclusions and hot trends, and also point you to sources with up-to-date information.

What is the DevOps Enterprise Summit? How is it different from other conferences?

Any article on a DevOps conference is supposed to start with a reflection on what DevOps is. But everyone is tired of that, so I'll try to get right to the point.

Even though there are many conflicting opinions on DevOps, most people would at least agree that "DevOps is a set of practices that make software delivery and infrastructure management easier." The difference in opinion comes from the fact that one group (usually administrators) says that "discussion of DevOps practices should be as practical as possible," while another (developers and, most often, development managers) considers those practices to be a "philosophy" and a "methodology."

Some might think that I'm oversimplifying, but, in a nutshell, applying administration methods and approaches to development changed the underlying ideology of the development process and made DevOps the new Agile. When we hear about a DevOps conference focused on management, what we get is a conference on changes in the development methodology. However, the DevOps Enterprise Summit is a conference that has something for administrators too, because it also includes more technical presentations.

The first DevOps Enterprise Summit (DOES) was held in 2014. It’s organized by a company called IT Revolution, whose founder, Gene Kim, you might know as a co-author of The Phoenix Project. IT Revolution is a publisher that launched a whole book series on DevOps and its connection to agile methodologies. But why does the DevOps Enterprise Summit have Enterprise in its name? This is an important point.

Implementing agile methodologies in a small company is easy: if you have only a few employees, chances are that you're already using these methodologies — you just don't call them agile. But if your company has thousands or tens of thousands of employees, it’s much more difficult. Such a company has a myriad of processes that were put in place to keep it running. We can imagine it as a big ship: when a ship sails from one continent to another, it performs its function. However, it can't quickly change direction or be rerouted — this will take time, especially for technological processes.

A startup with a few employees can build a prototype of a product in several days, while a company with thousands of employees would be building the same product for years, unless it transforms. That's why big companies want to change and start applying the methods of startups, but they also must figure out how to maintain their processes while transforming.

For example, Excel considers 1900 a leap year. This was intentionally implemented for backward compatibility with Lotus versions 1-3 that had this bug. But those versions were released in the 80s, so do we actually need that compatibility now? How appropriate is this approach of "we'll fix these bugs later" if you deploy software on, for example, a bank’s permanently offline servers? How can you stay flexible? That’s what DevOps Enterprise Summit 2020 was about.

The format of the event

Today, all conferences are trying to go virtual, often with interesting results. Here's how DOES did it:

Presentations are still grouped by tracks, and you can switch between them while watching the stream.
Presentations are pre-recorded in order to avoid technical issues with streaming, so while their presentation is being streamed, the speaker interacts with the audience through a Slack channel. I think this is what stood out to me the most in the organization of the conference. I have mixed feelings about this decision, because the presentations themselves are not interactive, but there’s more time for questions from the audience, and the speakers can answer more of them. However, I, for example, was mostly watching presentations and didn't want to switch between them and Slack, so I would only open it at the end of each presentation, when the time allotted to discussion had almost run out and the next speaker was about to start.
The interaction takes place in Slack, where each track has a dedicated Slack channel. There are also sponsor and networking channels.
There were much more networking activities than before. What I didn't like though was that you couldn't participate in them passively: at the Zoom meetups you had to introduce yourself, and at the round tables everyone who joined was actively discussing different topics. But sometimes you want to just take a look at what people are talking about and then, if you aren’t interested, leave.

Takeaways

DOES focuses on the experiences of horse companies instead of unicorn companies: in other words, companies that proved their worth through hard work over many years (as opposed to market capitalization). Among the examples, you'll see a 100-year-old insurance company, the largest European delivery service, and other companies of this type. You might even think of the approach taken by DOES as an anti-hipster trend centering around the practical application of new methodologies and technologies (see the introductory presentation by Gene Kim).
A cornerstone of DevOps as a methodology is "scenius" — collective genius. Scenius is opposed to individual genius, and that's the point of uniting teams that previously opposed to each other and establishing continuous interaction between them — with scenius, teams can advance to a higher level of problem solving (see the introductory presentation by Gene Kim, and also this link; or, if you prefer, you can just google "scenius").
Every second speaker was talking about a positive experience with creating a platform team as well as Platform Engineering in general, where a flexible microservices architecture runs on top of a platform that provides main infrastructure services. One of the presentations from DOES on this topic was “DevOps Journey at adidas III: Exploring Data in the Cloud,” by Fernando Cornago, VP, Platform Engineering, and Daniel Eichten, VP, Enterprise Architecture, adidas. (To learn more about it, you can also read this article or google "platform engineering" and "platform team.") By the way, it's quite amusing how established methods can be reintroduced as something new. A platform team is not just about engineers who create platforms, it’s also about sharing advice on how to operate them. A platform team is a team with a vision of how their platform should develop (platform advisory/community management/platform operation/platform evolution).
Every third speaker was talking about creating a DevOps Dojo, so let me elaborate on that a little. In the case of small companies or a small number of participating employees, meetups and workshops for sharing experience seem to be undoubtedly beneficial.

But how can a company with thousands or tens of thousands of employees do the same? This is the purpose of DevOps Dojos.

Dojo is a Japanese word meaning a hall for practicing martial arts. The American retailer Target introduced a similar practice into DevOps methodology: the company systematically organized internal workshops, meetups, and conferences, where employees could share their positive and negative experience, and which served as a safe environment for practicing their skills without being afraid of making mistakes. For more details, see this YouTube video.

Speakers from Swiss Re (a 156-year-old Swiss insurance company) were talking about the successful adoption of DevOps methodologies in their company, using their new service as an example. They started their DevOps transformation as an internal startup for three specific products. The company played the role of an investor for the startup, which had its own CEO, management, platform, etc.

By the way, according to Gartner, 76% of companies undergoing DevOps transformations consider themselves to be less than halfway through their journey to DevOps.

When starting a DevOps transformation, you also have to understand the risks. Many companies can complete the implementation of their DevOps initiatives only in five years, and in that time, many practices become outdated or irrelevant. The main issue here is that large companies use the waterfall model to implement an agile methodology, and this creates more problems along the way. Shifting to DevOps implies iteration. Another thing that can advance transformation is changes in a technology team that result in a new vision: speakers from Hermes Germany GmbH mentioned that their DevOps initiative was started by their new CIO. Interestingly, they were against the concept of internal startups, because such startups stay isolated and the company itself doesn't transform.

Almost all presentations about successful transformation cases mentioned the importance of metrics that indicate how successful the transformation was. You can find more about these metrics in State of Devops and also in Accelerate by Gene Kim.
As a tech geek, I was also particularly impressed by the “DevOps And Modernization 2.0 (CSG)” presentation by Scott Prugh. He shared how his company implemented DevOps methodologies in mainframe operation. I'd recommend waiting until the recording is released, because this presentation is amazing: the speaker was talking about migrating COBOL code and speeding up its deployment, rewriting 3.7 million lines of HLASM code in Java, and much more cool stuff! Conclusion: DevOps transformation is possible even for very complex infrastructures with tons of legacy technology.

N.B. Two books were mentioned by the presenters and I strongly recommend reading them: Team Topologies in Action and Accelerate: Building and Scaling High-Performing Technology Organizations.

Monitoring Microservice Applications: An SRE's Perspective

Evgeny Potapov — Tue, 07 Jul 2020 05:29:19 GMT

Today, infrastructure is made up of many small apps that run under the control of a single app manager, which manages the number of apps, their updates, and resource requests. This system is not a result of administrators trying to make infrastructure management easier — it’s a reflection of the modern thinking paradigm in software development. To understand why we’re talking about microservice architecture as an ideology, let's go 30 years back.

At the end of the 80s/beginning of the 90s, as PCs grew in popularity, object-oriented programming became the answer to the increasing number of software programs. Before that, software was essentially utilities, but then, large software projects became more common, and it turned into big business. There was nothing extraordinary about teams with thousands of members developing tons of new functionality.

Businesses had to figure out how to organize teamwork without creating a disastrous mess, and object-oriented programming was the answer.

However, the release cycle was still slow.

A company would plan the release of their product (let's say, Microsoft Office 95) a few years in advance. When the development stage had been completed, you'd have to thoroughly test your software, since fixing bugs would be difficult after the end users installed your product. Then you'd send your binary code to a factory which would make the necessary amount of copies on CDs or floppy disks. They were packaged in cardboard boxes and delivered to stores all over the world, where users would buy them and then install them on their PCs. This is the main difference from what we have now.

Monitoring Microservices DevOpsProdigy

Starting from 2010, as fast updates became a requirement in large software projects and companies, microservice architecture was chosen as the solution for that challenge. We no longer need to install applications on users' computers — instead we essentially "install" them in our infrastructure, which helps us to deliver updates quickly. This way, we can update software as fast as possible, and that enables us to experiment and test hypotheses.

Businesses need to create new functionality in order to retain and attract customers. They also need to experiment and figure out what makes customers pay more. Finally, businesses need to avoid lagging behind their competitors. So, a business might want to update its codebase dozens of times a day, and in theory you could do it even for one large application.

But if you split it into smaller pieces, managing updates would be easier. This is to say that switching to microservices wasn't a result of businesses trying to make applications and infrastructure more stable: microservices play an important role in Agile development, and agile software is what businesses strive for.

What does agile mean? It means speed, ease of implementing changes, and the option to change your mind. What matters here is not a solid product, but rather the speed of delivering a product and the possibility to try out concepts quickly. After trying them out, however, companies would then allocate resources for creating a solid product based on those concepts. In practice, it doesn't happen that often — especially in small teams and growing businesses where the main goal is to keep developing the product. This results in technical debt, which can be exacerbated by the belief that "we can just leave it to Kubernetes."

But that's a dangerous attitude. Recently, I stumbled upon a great quote that illustrates both the advantages and horrors of using Kubernetes in an operating environment:

"Kubernetes is so awesome that one of our JVM containers has been periodically running out of memory for more than a year, and we just recently realized about it."

Let's consider this carefully. Over the course of a year, an application was crashing because it was running out of memory, and the operations team didn't even notice. Does that mean that the application was mostly stable and working as intended? At first glance, this functionality is very useful: instead of sending an alert about a service crash so that the administrator would go and fix it manually, Kubernetes detects and restarts a crashed app on its own. This happened regularly during the year, and administrators didn't receive any alerts. I've also seen a project where a similar situation happened, and they only found out about it when they were generating a monthly report.

The reporting functionality was developed and deployed in production in order to help business users, but soon they started getting an HTTP 502 error in response to their requests — the app was crashing, the request wasn't processed properly, and then Kubernetes would restart the app. While the application was technically working, it was impossible to generate reports. The employees who were supposed to use that service preferred to create reports the old-fashioned way and didn't report the error (after all, the company needed those reports only once a month, so why bother anyone?), and the operations team didn't see the need to give high priority to a task that was monthly at best. But, as a result, all the resources spent on creating that functionality (business analysis, planning, and development) were actually wasted, and it became obvious only a year later.

Our past experiences helped us establish a set of practices aimed at minimizing risks in maintaining microservice applications. In this article, I'll share 10 of them (those that I consider to be the most important) and the contexts of their usage.

When service reboots are not monitored/not taken seriously

Example
See above. The problem here is, at least, a user not getting needed data and, at most, a systemically failing function.

What you can do
Basic monitoring: monitor whether your services reboot at all. There’s probably no need to give high priority to a service that reboots once in three months, but if a service starts to reboot every five minutes, take note.
Extended monitoring: keep an eye on all services that have been rebooted even once and organize a task setting process for analyzing those reboots.

When service errors, like fatal errors or exceptions, are not monitored

Example
An app doesn't crash but instead displays a stack trace of an exception to users (or sends it to another app via API). In this case, even if we monitor app reboots, we might miss situations where requests are processed incorrectly.

What you can do
You can aggregate app logs in a suitable tool and analyze them. You should look through the errors thoroughly, and if you find a critical error, we recommend assigning an alert to it and escalating the investigation.

When there’s no health-check endpoint or it doesn't do anything useful

Example
Thankfully, nowadays, creating endpoints that return service metrics (ideally as OpenMetrics) so that they can be read (for example, by Prometheus), is practically a standard. However, with businesses pressuring developers for new functionality, oftentimes they don't want to spend time on designing metrics. As a result, quite often the only thing a service health check can do is to return "OK." If an app can provide some output to display on the screen, it would be considered as "OK" But that's not how it should be. Such a health check, even if an app can't connect to its database server, would still return "OK," and that false information would be a hindrance for the investigation of the issue.

What you can do
First of all, having a health-check endpoint for all services should become the norm in your company, if it hasn't already. Secondly, health checks should also check the health and availability of all systems critical to the functioning of the service (such as access to queues, databases, availability of other services, etc.).

When API response time and service interaction time are not monitored

Example
These days, now that most parts of an application have turned into clients and servers interacting with each other, APIs have to know how soon this or that service responds. If the time has increased, one lag will lead to another, and due to the domino effect, the whole response time of the app will increase accordingly.

What you can do
Use tracing. Jaeger is pretty much standard now, and there’s a great team working on OpenTracing (in a similar fashion to the development of OpenMetrics). In this report and also here you can find the API for your programming language, which can provide OpenTracing metrics on app response time and service interaction time so that you can add them to Prometheus.

A service means an app, and an app means consumption of memory and CPU (and sometimes disk) resources

Example & What you can do
I think it's quite obvious. Many companies don't monitor the performance of services themselves; namely, how much CPU, RAM, and (if measurable) disk resources they use. In general, you should include all standard metrics used when monitoring a server. So, besides monitoring the whole node, we must also monitor each service.

Monitoring for new services

Example
This might sound odd, but it’s worth mentioning. When there are many development teams and even more services, and with the SRE being more focused on overseeing development, the operations team responsible for a specific cluster should monitor for new services in the cluster and receive notifications about them. You might have standards in place that define how to monitor a new service, its performance, and which metrics it should export, but when a new service appears, you still must verify the compliance with these standards.

What you can do
Set notifications for new services in your infrastructure.

Monitoring delivery time and other CI/CD metrics

Example
This is another relatively recent issue.

The performance of an application is influenced by its deployment speed. Complex CI/CD processes, combined with a more complicated app build process and the process of building a container for delivery, make seemingly simple deployments not so simple (here’s our article on that topic).

One day, you might find that deploying a certain service takes 20 minutes instead of one.

What you can do
Monitor how long it takes to deliver each of your apps, from building to the moment they begin running in production. If the delivery time starts to increase, look into it.

Application performance monitoring and profiling

Example & What you can do
When you learn that there’s an issue with one of your services (for example, the response time is too long, the service is not available, etc.), you won't be too excited about taking a deep dive into the service and restarting it in an attempt to pinpoint the issue. In our experience, tracing an issue is easy if you have detailed data from APM. Issues rarely appear out of the blue; they’re often a result of minor glitches piling up, and APM can help you understand when it all started. Another thing you can do is learn how to use system-level profilers — thanks to the development of eBPF, there are many opportunities for that.

Monitoring security: WAF, Shodan, images and packages

Example
Monitoring shouldn't be restricted to performance. It can also help with ensuring the security of your service:

Start monitoring the results of executing "npm audit" (or equivalent commands) included in your app's build process — you'll get alerts if there are any issues with the library versions that you use, and if that's the case, you can update them.
Using Shodan API (Shodan finds open ports and databases on the Internet), check your IP addresses to make sure that you don't have any ports open and that your databases haven't been leaked.
If you use WAF, set alerts for WAF events so that you can see any intentional intrusions and the attack vectors used by the intruder.

A bonus tip: SREs, keep in mind that your app's response time doesn't equal your server's response time!

We’re used to measuring a system's performance by its server's response time, but 80% of a modern app's logic is in the frontend. If you aren’t already measuring your app's response time as the time it takes to display a page as well as frontend page-load metrics, you should start doing that. Users don't care whether your server's response time is 200 or 400 milliseconds if your Angular- or React-based frontend takes 10 seconds to load the page. In general, I believe that performance optimization in the future will be focused on the frontend, or even emerge as its own new field.

Six Ways to Build Docker Images Faster (Even in Seconds)

Evgenii Finkelstein — Fri, 12 Jun 2020 15:35:00 GMT

Nowadays, complex orchestrators and CI/CD are essential for software development, but as a result, a feature has a long way to go from a commit to testing and delivery before it reaches production.

Previously, developers used to upload new files to a server via FTP, so deployment took a few seconds. But now we have to create a merge request and wait a long time for a feature to get to the users.

Building Docker images, which is a part of this process, might take up to a few dozens of minutes. This is hardly acceptable. In this article, we'll dockerize a simple application, then use several methods for speeding up build time and consider their nuances.

Docker images

Our team successfully created and now supports several media websites. For example, among our projects are websites for such Russian media outlets as TASS, The Bell, Novaya Gazeta, and Republic. Recently, we were deploying the Reminder website to production, and while we were adding new features and fixing old bugs, slow deployment became a big problem.

We use GitLab for deployment: we build Docker images, push them to our Container Registry and deploy our container images to production. Building Docker images is the longest process on this list. For example, it took 14 minutes to build each non-optimized backend image.

Something had to be done about that. We decided to figure out why building Docker images took so long and how to fix the situation. As a result, we were able to reduce build time to 30 seconds!

To make our example more or less universal, we will create an empty Angular application:

ng n app

Let’s add PWA support to it (we are progressive, aren't we?):

ng add @angular/pwa --project app

While all those tons of NPM packages are being downloaded, let's talk about what a Docker image is. Docker makes it possible to package applications and run them in an isolated environment called a container. Thanks to such isolation, you can run multiple containers on a single server at the same time. Unlike virtual machines, Docker containers run directly on the kernel, so they are more lightweight. Before running a dockerized application, we build a Docker image into which we package everything needed for our application to function. A Docker image is like a cast of a file system. For example, let's take this Dockerfile:

FROM node:12.16.2
WORKDIR /app
COPY . .
RUN npm ci
RUN npm run build --prod

A Dockerfile is a set of instructions. Docker executes these instructions step by step and saves the changes to the file system, adding them to the previous ones. Each command creates its own layer. A finished Docker image is all these layers combined together.

It's important to know that Docker caches each layer. If nothing has changed since the previous build, Docker will use the completed layers instead of executing the commands. The main increase in the build speed is due to effective cache usage, so we will focus on the speed of building Docker images with a ready cache when measuring build speed. Let’s go step by step:

1. First, we need to delete images locally so that previous runs do not affect the test.

docker rmi $(docker images -q)

2. Next, let’s run our build for the first time.

time docker build -t app .

3. Now we change src/index.html, simulating the work of a programmer.

4. Then we run the build for the second time.

time docker build -t app .

If we set up our build environment correctly (see details below), Docker will already have a bunch of caches when starting to build images. Our goal is to learn how to utilize the cache so that the build process is performed as quickly as possible. Since we don't utilize the cache the first time we build our image, we can ignore how slow it was. In testing, we are interested in the second run of the build, when our caches are already warmed up and we are ready to bake the cake — build the image. However, applying some tips will affect the first build too.

Let’s put the Dockerfile described above in the project folder and run the build process. All listings are shortened for readability.

$ time docker build -t app .
Sending build context to Docker daemon 409MB
Step 1/5 : FROM node:12.16.2
Status: Downloaded newer image for node:12.16.2
Step 2/5 : WORKDIR /app
Step 3/5 : COPY . .
Step 4/5 : RUN npm ci
added 1357 packages in 22.47s
Step 5/5 : RUN npm run build --prod
Date: 2020-04-16T19:20:09.664Z - Hash: fffa0fddaa3425c55dd3 - Time: 37581ms
Successfully built c8c279335f46
Successfully tagged app:latest

real 5m4.541s
user 0m0.000s
sys 0m0.000s

Then we must change the contents of src/index.html and run it for the second time.

$ time docker build -t app .
Sending build context to Docker daemon 409MB
Step 1/5 : FROM node:12.16.2
Step 2/5 : WORKDIR /app
---> Using cache
Step 3/5 : COPY . .
Step 4/5 : RUN npm ci
added 1357 packages in 22.47s
Step 5/5 : RUN npm run build --prod
Date: 2020-04-16T19:26:26.587Z - Hash: fffa0fddaa3425c55dd3 - Time: 37902ms
Successfully built 79f335df92d3
Successfully tagged app:latest

real 3m33.262s
user 0m0.000s
sys 0m0.000s

Now we execute the docker images command to see if our image has been successfully created:

REPOSITORY TAG IMAGE ID CREATED SIZE
app latest 79f335df92d3 About a minute ago 1.74GB

Before the build process starts, Docker takes all the files in the current build context and sends them to the Docker daemon: Sending build context to Docker daemon 409MB. The build context is indicated by the last build command argument — in our case, it is a period (.), which means that Docker will take all files from the current folder. But 409MB is a lot, so let's think about how to fix this situation.

Reducing build context size

There are two ways to reduce context size:

We can put all the files needed for the build in a separate folder and point Docker to it. It's not always convenient.
We can exclude files that are not needed for the build by adding a .dockerignore file to the context directory:

.git
/node_modules

Let's build our image again:

$ time docker build -t app .
Sending build context to Docker daemon 607.2kB
Step 1/5 : FROM node:12.16.2
Step 2/5 : WORKDIR /app
---> Using cache
Step 3/5 : COPY . .
Step 4/5 : RUN npm ci
added 1357 packages in 22.47s
Step 5/5 : RUN npm run build --prod
Date: 2020-04-16T19:33:54.338Z - Hash: fffa0fddaa3425c55dd3 - Time: 37313ms
Successfully built 4942f010792a
Successfully tagged app:latest

real 1m47.763s
user 0m0.000s
sys 0m0.000s

Yes, 607.2KB is much better than 409MB. And we also reduced the image size from 1.74GB to 1.38GB:

REPOSITORY TAG IMAGE ID CREATED SIZE
app latest 4942f010792a 3 minutes ago 1.38GB

Now we will try to reduce the image size even more.

Using Alpine Linux

Another way to keep your Docker image size down is to use a small parent image. A parent image is the image that our image is based on. The lowest layer is indicated with the FROM command in a Dockerfile. In our case, we’ll use an image based on Ubuntu with Node.js already installed. But it’s almost 1GB (a monstrous size!).

$ docker images -a | grep node
node 12.16.2 406aa3abbc6c 17 minutes ago 916MB

We can dramatically reduce the image size if we use an image based on Alpine Linux. Alpine Linux is an extremely lightweight Linux distribution. An Alpine image for Node.js is only 88.5MB! So let's replace our big image with the smaller one:

FROM node:12.16.2-alpine3.11
RUN apk --no-cache --update --virtual build-dependencies add
python
make
g++
WORKDIR /app
COPY . .
RUN npm ci
RUN npm run build --prod

We had to install a few things that are also needed to build our application. Yes, Angular won’t build without Python. ¯(°_o)/¯

But it was worth it — we reduced the image size by 619MB:

REPOSITORY TAG IMAGE ID CREATED SIZE
app latest aa031edc315a 22 minutes ago 761MB

Let's go even further.

Using multi-stage builds

We will take from our image only what is actually needed in production. This is what we have now:

$ docker run app ls -lah
total 576K
drwxr-xr-x 1 root root 4.0K Apr 16 19:54 .
drwxr-xr-x 1 root root 4.0K Apr 16 20:00 ..
-rwxr-xr-x 1 root root 19 Apr 17 2020 .dockerignore
-rwxr-xr-x 1 root root 246 Apr 17 2020 .editorconfig
-rwxr-xr-x 1 root root 631 Apr 17 2020 .gitignore
-rwxr-xr-x 1 root root 181 Apr 17 2020 Dockerfile
-rwxr-xr-x 1 root root 1020 Apr 17 2020 README.md
-rwxr-xr-x 1 root root 3.6K Apr 17 2020 angular.json
-rwxr-xr-x 1 root root 429 Apr 17 2020 browserslist
drwxr-xr-x 3 root root 4.0K Apr 16 19:54 dist
drwxr-xr-x 3 root root 4.0K Apr 17 2020 e2e
-rwxr-xr-x 1 root root 1015 Apr 17 2020 karma.conf.js
-rwxr-xr-x 1 root root 620 Apr 17 2020 ngsw-config.json
drwxr-xr-x 1 root root 4.0K Apr 16 19:54 node_modules
-rwxr-xr-x 1 root root 494.9K Apr 17 2020 package-lock.json
-rwxr-xr-x 1 root root 1.3K Apr 17 2020 package.json
drwxr-xr-x 5 root root 4.0K Apr 17 2020 src
-rwxr-xr-x 1 root root 210 Apr 17 2020 tsconfig.app.json
-rwxr-xr-x 1 root root 489 Apr 17 2020 tsconfig.json
-rwxr-xr-x 1 root root 270 Apr 17 2020 tsconfig.spec.json
-rwxr-xr-x 1 root root 1.9K Apr 17 2020 tslint.json

Using docker run app ls -lah, we ran a container based on our app image and executed the ls -lah command, then our container exited.

For production, we need only the dist folder. Also, we need to send files out somehow. We can run some kind of Node.js HTTP server, but there is an easier way. We take our image with Nginx and put the dist folder and the small config file below in it:

server {
listen 80 default_server;
server_name localhost;
charset utf-8;
root /app/dist;

location / {
    try_files $uri $uri/ /index.html;
}

}

We'll do this by using multi-stage builds. Let's change our Dockerfile:

FROM node:12.16.2-alpine3.11 as builder
RUN apk --no-cache --update --virtual build-dependencies add
python
make
g++
WORKDIR /app
COPY . .
RUN npm ci
RUN npm run build --prod

FROM nginx:1.17.10-alpine
RUN rm /etc/nginx/conf.d/default.conf
COPY nginx/static.conf /etc/nginx/conf.d
COPY --from=builder /app/dist/app .

Now we have two FROM commands, and each one begins its own stage of the build process. We named our first stage builder, and the second one starts the process of creating our final image. In the last step we copy the artifact from the builder stage to our final Nginx image. The image size has reduced significantly:

REPOSITORY TAG IMAGE ID CREATED SIZE
app latest 2c6c5da07802 29 minutes ago 36MB

Let's run our image as a container and make sure everything works:

docker run -p8080:80 app

With the -p8080:80 option, we forwarded port 8080 of our host machine to port 80 of the container in which Nginx is running. Now we open http://localhost:8080/ in a browser and see our application. It works!

Reducing the size of the image from 1.74GB to 36MB greatly shortens the delivery time of an application into production. But let's get back to build time.

$ time docker build -t app .
Sending build context to Docker daemon 608.8kB
Step 1/11 : FROM node:12.16.2-alpine3.11 as builder
Step 2/11 : RUN apk --no-cache --update --virtual build-dependencies add python make g++
---> Using cache
Step 3/11 : WORKDIR /app
---> Using cache
Step 4/11 : COPY . .
Step 5/11 : RUN npm ci
added 1357 packages in 47.338s
Step 6/11 : RUN npm run build --prod
Date: 2020-04-16T21:16:03.899Z - Hash: fffa0fddaa3425c55dd3 - Time: 39948ms
---> 27f1479221e4
Step 7/11 : FROM nginx:stable-alpine
Step 8/11 : WORKDIR /app
---> Using cache
Step 9/11 : RUN rm /etc/nginx/conf.d/default.conf
---> Using cache
Step 10/11 : COPY nginx/static.conf /etc/nginx/conf.d
---> Using cache
Step 11/11 : COPY --from=builder /app/dist/app .
Successfully built d201471c91ad
Successfully tagged app:latest

real 2m17.700s
user 0m0.000s
sys 0m0.000s

Changing the order of layers

Docker cached the first three steps (Using cache). In the fourth step, all project files are copied, and in the fifth step, npm ci installs dependencies, which takes a whole 47.338 seconds. Why do we need to re-install dependencies each time if they change very rarely? Let's see why they haven't been cached. The thing is that Docker checks layer by layer to see if the command and the files associated with it have changed. In the fourth step, we copy all the files of our project, and some of them have been changed. This is why Docker doesn't use a cached version of not only this layer but the following ones as well! Let's make a few changes to our Dockerfile.

FROM node:12.16.2-alpine3.11 as builder
RUN apk --no-cache --update --virtual build-dependencies add
python
make
g++
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build --prod

FROM nginx:1.17.10-alpine
RUN rm /etc/nginx/conf.d/default.conf
COPY nginx/static.conf /etc/nginx/conf.d
COPY --from=builder /app/dist/app .

First, package.json and package-lock.json are copied, then dependencies are installed, and only after that the whole project is copied. As a result:

$ time docker build -t app .
Sending build context to Docker daemon 608.8kB
Step 1/12 : FROM node:12.16.2-alpine3.11 as builder
Step 2/12 : RUN apk --no-cache --update --virtual build-dependencies add python make g++
---> Using cache
Step 3/12 : WORKDIR /app
---> Using cache
Step 4/12 : COPY package*.json ./
---> Using cache
Step 5/12 : RUN npm ci
---> Using cache
Step 6/12 : COPY . .
Step 7/12 : RUN npm run build --prod
Date: 2020-04-16T21:29:44.770Z - Hash: fffa0fddaa3425c55dd3 - Time: 38287ms
---> 1b9448c73558
Step 8/12 : FROM nginx:stable-alpine
Step 9/12 : WORKDIR /app
---> Using cache
Step 10/12 : RUN rm /etc/nginx/conf.d/default.conf
---> Using cache
Step 11/12 : COPY nginx/static.conf /etc/nginx/conf.d
---> Using cache
Step 12/12 : COPY --from=builder /app/dist/app .
Successfully built a44dd7c217c3
Successfully tagged app:latest

real 0m46.497s
user 0m0.000s
sys 0m0.000s

The process took 46 seconds instead of 3 minutes, which is much better! It is important to arrange layers in the correct order: first we copy non-changing layers, then those that change rarely and, in the end, those that change often.

Next, let’s talk a little about building Docker images in CI/CD systems.

Using previous images as a cache source

If we use some kind of SaaS solution for building our Docker images, the local docker cache may be absolutely empty. We need to give Docker an image from the previous build, so that it gets ready layers.

For example, let’s consider building our applications in GitHub Actions. We will use the following config file:

on:
push:
branches:
- master

name: Test docker build

jobs:
deploy:
name: Build
runs-on: ubuntu-latest
env:
IMAGE_NAME: docker.pkg.github.com/${{ github.repository }}/app
IMAGE_TAG: ${{ github.sha }}

steps:
- name: Checkout
  uses: actions/checkout@v2

- name: Login to GitHub Packages
  env:
    TOKEN: ${{ secrets.GITHUB_TOKEN }}
  run: |
    docker login docker.pkg.github.com -u $GITHUB_ACTOR -p $TOKEN

- name: Build
  run: |
    docker build \
      -t $IMAGE_NAME:$IMAGE_TAG \
      -t $IMAGE_NAME:latest \
      .

- name: Push image to GitHub Packages
  run: |
    docker push $IMAGE_NAME:latest
    docker push $IMAGE_NAME:$IMAGE_TAG

- name: Logout
  run: |
    docker logout docker.pkg.github.com

Building and pushing our image to GitHub Packages took 2 minutes and 20 seconds:

Now we’ll change the build configuration so that Docker will use the cached layers from the previous steps:

on:
push:
branches:
- master

name: Test docker build

jobs:
deploy:
name: Build
runs-on: ubuntu-latest
env:
IMAGE_NAME: docker.pkg.github.com/${{ github.repository }}/app
IMAGE_TAG: ${{ github.sha }}

steps:
- name: Checkout
  uses: actions/checkout@v2

- name: Login to GitHub Packages
  env:
    TOKEN: ${{ secrets.GITHUB_TOKEN }}
  run: |
    docker login docker.pkg.github.com -u $GITHUB_ACTOR -p $TOKEN

- name: Pull latest images
  run: |
    docker pull $IMAGE_NAME:latest || true
    docker pull $IMAGE_NAME-builder-stage:latest || true

- name: Images list
  run: |
    docker images

- name: Build
  run: |
    docker build \
      --target builder \
      --cache-from $IMAGE_NAME-builder-stage:latest \
      -t $IMAGE_NAME-builder-stage \
      .
    docker build \
      --cache-from $IMAGE_NAME-builder-stage:latest \
      --cache-from $IMAGE_NAME:latest \
      -t $IMAGE_NAME:$IMAGE_TAG \
      -t $IMAGE_NAME:latest \
      .

- name: Push image to GitHub Packages
  run: |
    docker push $IMAGE_NAME-builder-stage:latest
    docker push $IMAGE_NAME:latest
    docker push $IMAGE_NAME:$IMAGE_TAG

- name: Logout
  run: |
    docker logout docker.pkg.github.com

Here we must explain why we need two build commands. The thing is that in multi-stage builds, the resulting image is a set of layers from the last stage. The layers of the previous stages are not included in the image. Therefore, when using the final image from the previous build, Docker will not be able to find ready layers for building an image with Node.js (the builder stage). In order to solve this problem, we create an intermediate image, $IMAGE_NAME-builder-stage, and send it to GitHub Packages so that it can serve as a cache source for the subsequent build.

The total build time was reduced to 1.5 minutes. Half a minute was spent on pulling the previous images.

Using pre-built Docker images

Another way to solve the problem of a clear Docker cache is to move some of the layers to another Dockerfile, build that image separately, push it to the Container Registry, and use it as a parent image.

Let's create our Node.js image for building an Angular application. First, we create a Dockerfile.node in our project:

FROM node:12.16.2-alpine3.11
RUN apk --no-cache --update --virtual build-dependencies add
python
make
g++

Then we build and push our public image to Docker Hub:

docker build -t exsmund/node-for-angular -f Dockerfile.node .
docker push exsmund/node-for-angular:latest

Now we use the finished image in our main Dockerfile:

FROM exsmund/node-for-angular:latest as builder
...

In this example, the build time has not decreased, but pre-built images can be useful if you have many projects and you have to put the same dependencies in all of them.

In this article, we’ve examined several methods to speed up building Docker images. If you want to deploy faster, you can try:

reducing context;
using small parent images;
using multi-stage builds;
reordering commands in your Dockerfile in order to utilize the cache efficiently;
configuring a cache source in CI/CD systems;
using pre-built Docker images.

I hope these examples clarify how Docker works, and you will be able to optimize your deployment using my tips. If you want to experiment with the examples from this article, here is the link for the repository: https://github.com/devopsprodigy/test-docker-build.

Kubernetes, Microservices, CI/CD and Docker: Learning Tips for Old School People

Evgeny Potapov — Tue, 02 Jun 2020 05:52:05 GMT

It seems some people are fed up with the topic of why we need Kubernetes. You might say that everyone who needs it has long understood the importance of Kubernetes, but I would divide IT specialists, IT managers, etc. into two categories: those who understand Kubernetes and know how to use it, and those who understand its importance but wonder how to fill in the gaps in their knowledge.

Perhaps you are a manager who has been working with the same technology stack for the past 10 years; or you are a developer who supports an old product or writes in the same old language in the same old environment.

Maybe you just switched from technical to organizational management and suddenly found out that your skills are no longer relevant, and you want to find a relatively simple learning scenario to catch up.

Having worked in organizational management, I know that it can be difficult to keep up with trends in IT culture, and one might start saying “Kubernetes is effective, and we must implement it” like an incantation, not quite understanding what it means. That’s why I will try my best to give you advice based on my own experience.

Why do I think it’s important to be able to change the paradigm of technological thinking?

The most difficult thing for those who have been working in IT for a long time is to accept that some new trends are there to stay. Over 20 years of working in IT, I’ve seen how various technologies appeared and disappeared, and some of them were “super relevant” only for a few months.

Joel Spolsky wrote that Microsoft systematically creates new stacks for developers in order to prevent them from considering other technologies. As an SRE, I was doubly suspicious of each new technology, since everything new is raw, and everything raw is unstable. All unstable things lead to problems in production, and production stability is the most important thing.

As a programmer and entrepreneur, I wanted to develop my product faster. However, since I had to learn all those new technologies and change my usual approach to development, it took me longer to roll out new features. While some new technologies were easy to apply, others related to microservice-oriented development (that's how I'll be referring to the whole current stack) required more thorough examination. Every year you have to spend more and more time studying, so it’s much easier to write programs in the good old way and deliver your product faster.

But the fact remains that sometimes new technologies stay and completely change the whole paradigm. In this case, you either remain true to the old paradigm or move on to the new one. COBOL programmers can still get a job, Perl developers can expect to work for booking.com, but there are fewer and fewer job opportunities for them. Eventually, commitment to the old ways in the name of stability becomes a hindrance. If you don’t want to limit your options, then hurry and start researching the current technology stack ASAP in order not to lag behind even more. If you don’t want to get stuck in the past with Perl, you need to learn new things. Yes, it takes a lot of time, but I think I can help you by sharing my learning experience, step by step.

Things to research, understand and accept

First, you need to understand how to run applications in Docker containers. Old school people should understand that the way to store and run applications has changed forever. A new developer most likely has no idea how to run an application in production without Docker. They probably don't even think of storing files locally, except in rare cases with shared storage. However, IT veterans need to concede that the Docker container is the new EXE. Although EXE was just one of the executable file types and it has its drawbacks, it was the only way to run an application — just like Docker containers now.

Yes, the microservice architecture has also become the standard, like object-oriented programming in its time. OOP was created in order to make it easier for large teams to develop large software projects; now microservices serve the same purpose. The same people are behind both of these projects too (see Fowler). This is reasonable: if the API is versioned, it is easier for self-contained teams to write small independent applications than a large monolithic one. It is worth arguing whether we need to use microservices for small projects too, but at some point everyone started writing them in OOP style, just because it’s so familiar (see about EXE above). Of course, interprocess communication (especially if it uses the TCP stack) has some disadvantages in terms of performance (one application connects to another via TCP instead of just calling a function which would accomplish the same goal — can you imagine the difference in throughput performance?), but the fact remains that microservices allow us to develop medium and large projects faster and, moreover, they have become the standard. You also have to understand how microservices interact with each other (HTTP API, grpc, asynchronous communication with queues, etc.).

Optionally, you can also learn more about service mesh. (First they start dividing applications, then they realize that service-to-service communication is so darn complicated! So they add an extra layer in order to fix that mess. Why, just why?)

Understand how to manage a stack of microservices running in Docker containers. So, we have resigned ourselves to running applications in Docker containers and breaking up an application into microservices. Now we need to somehow manage our running Docker containers. You can do it yourself on dedicated servers (for example, with Docker Swarm, or you can set up Kubernetes), or you can use cloud provider services, for example, by AWS.

There is one very big advantage to using cloud environments: you don’t have to think about the layer below your container manager.

(SREs are probably laughing right now, but we all know that we do not tinker with GKE nodes when things are stable.) In fact, as we see in the example of Kubernetes, container managers turn into operating systems.

Kubernetes, which has become the standard container manager, has package managers and CronJobs. You can install software on Kubernetes clusters and run Docker containers (sort of like EXE files). Kubernetes is pretty much a new OS.

Understand how to deliver Docker containers. Now deploying a simple website takes 5 minutes, and people consider this the norm. You need to build a Docker image, test it, and push it to your registry and your container manager (we’ll use Kubernetes as an example). Everyone is used to this process, it can be optimized, and it is the standard. You'll also need to understand CI/CD and GitOps.

Understand that on-premises hosting for most applications is already a thing of the past. Some time ago, it was OK to buy and assemble servers, bring them to a data center and get them collocated, racked, stacked and connected in a network. Then dedicated servers became popular. Since then, it is unlikely that someone will want to actually buy and assemble hardware for small and medium projects.

I have been using AWS since 2008 and, of course, it has its problems. But I don't see why we need to manage Kubernetes and dedicated servers ourselves if someone else can do it for us. (I mean services such as EKS, GKE, etc.) This is also true for databases.

For most applications that are not designed for very high loads or require extensive performance tuning, cloud-based PostgreSQL/MongoDB/MySQL is much better: you don't have to think about tuning or backups.

You can create a dedicated server from a production server using just a couple of commands in a cloud console. Admins, you might feel less than excited about it, but, being an admin myself, I learned from experience that database management is mostly required only for high-load projects. Perhaps AWS and GKE services are expensive or even inaccessible for some of us due to legal restrictions, but sooner or later, other similar services will provide the same capabilities, and the paradigm will change.

Understand that Infrastructure as Code is a thing now. I didn’t like IaC when it was represented by Chef and Puppet. Fortunately, they were replaced by the more suitable Terraform and Pulumi for describing what you want to set up in a cluster and Ansible for working with your infrastructure. Using the shell is faster and more convenient, but it does not fit into the new paradigm.

Steps to learn the modern stack

I think I see a suitable specific technical way to figure out how to work with the modern stack.

1. Create a trial account on any cloud hosting platform. I started with GKE, but you may prefer an account provided by another hosting service. If Terraform/Pulumi support your cloud provider, use them to describe the infrastructure that you want to create. If you have programming skills, I recommend Pulumi: in this case you can use familiar languages and constructions instead of Terraform’s configuration language.

2. Put an application into a Docker container. What application you choose is up to you. For example, I suddenly discovered that NodeJS is very common now and decided to research its uses, so I work with it. Here is a NodeJS blog that you can set up.

3. Understand the basic constructs (pod, deployment, service) of Kubernetes (K8S) and manually deploy your application to your K8S cluster.

4. Understand what Helm is and how to use it, create a Helm chart and deploy your application using Helm. Get the free plan on CircleCI as a Cl/СD tool which you do not need to install. As for configurations, they are similar to the ones in other systems.

5. Deploy your application using Cl. Separate CI (which builds applications) from CD. Handle the CD part with GitOps (e.g., ArgoCD).

What’s next?

After going through these steps, you will know the basics of the modern stack.
But how else can you pump up your skills?

If you are looking for a job in Europe or North America or want to work there in the future, you can deepen your knowledge of cloud environments by passing the Google Cloud Architect Certification exam or its equivalent from AWS. (Our team recently got three such certificates.) As you prepare for certification, you gain a better understanding of cloud features. You can try this training course at linuxacademy.

Pass the CKA exam, which is tough but worth it. Preparing for this exam will help you to learn a ton about Kubernetes administration.

Learn to program. Personally, I’m learning frontend development. I was surprised how much it has changed since 2012, as I only got to JQuery. (Ridiculous by today's standards.) Frontend has become more complicated than backend. The former includes a lot of application logic plus paradigms that are completely unusual for me. It’s very interesting!

Resilience Engineering: Lessons from REDeploy 2019

Evgeny Potapov — Wed, 06 May 2020 02:07:15 GMT

Last October, I attended a conference called REDeploy 2019 dedicated to resilience engineering. Old news, you say? Just in time, says I! Today, when almost all conferences and fora have moved online, we appreciate offline meetings more, and value networking like never before. Here, I want to share with you the lessons I learned from one of the most informative events of the last year, introduce you to experienced experts in the field, and share a few thoughts on the most critical quality we all need these days: the ability to deal with failures and move forward regardless of pressure.

Though the term has been in use for a decade already, resilience engineering still needs a bit of explanation. First, it is important to understand that resilience engineering is a cross-disciplinary field. It pursues research, formalization, and formation of practices that increase the ability of complex socio-technical systems to resist unusual situations, to adapt to accidents, and keep improving adaptability.

For many years, software development has been seen through a mechanistic lens. We believed that we could develop failure-free software. Even if an accident happens, we thought, there will be an identifiable root cause, which we can solve and thereby prevent the recurrence of similar mistakes in the future. We were sure that the number of errors was bounded, so in the end it would be possible to correct all the errors that can cause accidents. For more on that, check out this great article by J. Paul Reed: Dev, Ops, and Determinism.

The same technical approach was applied to how people interact during an accident. Many believed that it was enough to create some toolkit, give it to people, and voila—they solve every problem and don’t make any mistakes.

In fact, the problem is that software is constantly being updated. It becomes complicated, siloed, and branched. Every accident has its own separate root cause, which might even be outside the system. Finally, people can make mistakes when they communicate with each other about the ways to solve the accident.

Thus, the task is no longer about avoiding errors and accidents in the system. The task is to train people and systems to guarantee that future accidents make the least impact on the system, its users and creators.

Software development has long been aloof from the other, “offline” engineering disciplines that have long been using harm reduction practices. To prevent accidents, those practices refer more to people rather than tools and technical solutions.

Resilience engineering focuses on the following questions:

What cultural and social peculiarities of human interaction should be well understood to better predict what can and cannot occur in communication between people during an accident? How can this process of adaptation and communication be improved? And, on the flipside, when and how can the process go wrong?
What knowledge from other disciplines can we apply to make the system more flexible and resilient in the event of an accident?
How shall we organize training and human interaction to ensure that, in the event of an accident, we can minimize the damage and the stress from resolving it?
What technical solutions or practices will help here?
How can deliberate actions enhance the system's stability and adaptability to accidents?

Those were the key issues of the October REDeploy2019 conference. Let's look at some presentations.

A Few Observations on the Marvelous Resilience of Bone and Resilience Engineering. Richard Cook

This speaker deserves a separate introduction. Richard Cook is a research scientist, physician, and one of the main communicators of resilience engineering in the IT sphere. Together with David Woods and John Alspaugh (the man who actually launched DevOps as a separate field by making Dev and Ops work together), Cook founded Adaptive Capacity Labs, which introduces sustainable engineering principles in other organizations' matrices.

It is important to note that REDeploy is not purely an IT conference, and this presentation proved that. A large part of the presentation was a detailed analysis of how a broken bone gets healed. That healing process was presented as an archetype of resilience. Without external medical help, bones coalesce incorrectly. Medicine has been learning to heal bones by studying the healing process. In fact, medicine does not heal bones, it enacts processes that promote healing.

In general, the treatment process can be divided into two categories:

· Treatment as a process that creates the most favorable conditions for bone healing (for example, we apply gypsum to hold the bone steady);

· Treatment as a process to “improve” bone healing (we understand the biochemical processes involved, and apply drugs that speed up those processes).

And here is the thing: that presentation uncovered the key aspects for the whole sector. Why do we need to understand the socio-technical processes at work during an accident?

By understanding how the “treatment” mechanism (e.g., solution of an emergency situation) works, we can, first of all, organize conditions that will minimize damage, and, second of all, speed up the process of incident resolution. We can't make people stop breaking their bones. But we can improve their healing processes.

2. The Art of Embracing Failure at Scale. Adrian Hornsby

This presentation is purely IT and shows the evolution of resilience in the AWS infrastructure. Without getting into technical details (you can check them via the link above), let's see the core thesis of the presentation.

When building various systems, AWS designs the architecture from the following point of view: an accident is going to happen sooner or later.

Thus, the system's architecture should be designed to limit the “blast radius” in case of an accident. Customers' DBs, for example, their data stores, are all divided into groups of “cells,” and traffic from one customer only affects the users of that customer's cell. Cell replicas do not duplicate the original cells. By being shuffled all together, they thereby limit the radius of impact to a minimum.

By increasing the number of such combinations, we mitigate the risk of customer engagement in case of an accident.

3. Getting Comfortable with Being Underwater. Ronnie Chen

This was a presentation by a Twitter manager who has experience in technical deep-sea diving. He talked about the specifics of the security measures for diving.

Team deep diving is a process that’s associated with greater risk. When you organize a deep dive, you can't think that deep diving will only be possible when there are no risks. If that were the case, there would be no deep diving at all.

Problems may happen one way or another, and that's fine. Chen associates taking chances responsibly with a tool for human development. If we offset the risks, we will limit our potential. The task here, again, is to organize the easiest way to solve problems if and when they materialize.

How will teams deal with the pressure while performing risky activities? Here are a few rules of diving team engagement:

- There must be a reliable and unbreakable communication between the team members, and maximum psychological security shall be guaranteed for everyone. The latter means, among other things, that everyone has an opportunity to terminate the dive at any moment (no criticism allowed).

- Green light for mistakes. Everyone has a right to make mistakes, which are inevitable in the working process. Accusations of mistakes are unacceptable.

- The team can redefine the project's objectives and its success during diving and according to changing conditions.

- The team is composed of people with similar stress resistance; the least experienced member guides the team’s actions.

- One of the most important tasks is to build experience of each team member. Besides first-hand experience, there is a focus on fail stories. All team members share their stories of failed dives or making mistakes to the whole team, so everyone gains useful experience.

- Postmortem (those fail stories) are not there to find the root cause, which in most cases does not exist, but to share the experience.

4. The Practice of Practice: Teamwork in Complexity. Matt Davis

Considering the fact that in case of an accident, engineers act mostly intuitively, intuition was compared by the presenter to musical improvisation.

Musical improvisation is an intuitive process of playing music, where intuition is based on previous experience, including knowledge of musical scales, previous improvisations, and teamwork.

Plus, there is a bi-directional process: experience builds intuition, while analysis of intuitive actions forms the processes (in music, for example, the musical notes of a created composition are written down; in technologies, the process of fixing the accidents is described).

There are two ways to develop/ train intuition:

- Postmortem. Instead of being a means to assign blame or prevent problems in the future, a postmortem should be a tool to accumulate and share experience. Do regularly share stories of dealing with accidents to pass on your experience of solving different problems!

- Chaos Engineering as a way to build experience under controlled conditions. By artificially creating an accident in the system, we allow the engineers to gain intuitive experience. We can determine the stack of technologies where we want to build capabilities, and, at the same time, limit the blast radius for the system.

Developing a Plugin for Grafana: Story of Struggles and Successes

Evgeny Potapov — Fri, 24 Apr 2020 02:34:00 GMT

A few months ago, we released our new open-source project: a Grafana plugin for monitoring Kubernetes, which we called DevOpsProdigy KubeGraf. The source code of the plugin is available in a public repository on GitHub. And in this article, we want to tell you the story of the plugin’s creation, the tools we used, and the pitfalls we faced in the development process. So, let's do it!

Part 0 — Introductory: How Did We Get There?
The idea to write our own plugin for Grafana came to us out of the blue. For over 10 years, our company has been monitoring web projects of various complexity. Over that time, we have built up a lot of expertise, gathered many interesting case studies, and gained vast experience with different monitoring systems. At some point, we thought, "Is there a magic tool for monitoring Kubernetes? One that you can just install and forget about?" The industry standard for monitoring K8s is, of course, a combo of Prometheus + Grafana. There is a large set of various ready-made solutions for this stack, including prometheus-operator, plus a set of kubernetes-mixin and grafana-kubernetes-app dashboards. We considered grafana-kubernetes-app to be the most interesting option, but it hadn't been supported for more than a year. Besides, it wasn't able to work with new versions of node-exporter and kube-state-metrics. So, we asked ourselves: "Why don't we do it on our own?"

So, here are the ideas we decided to implement in our plugin:
• visualization of an "app map": a convenient presentation of apps in a cluster, where they are grouped by namespaces, deployments, etc.

• visualization of links: "deployment — service (+ports)"

• visualization of the application cluster’s distribution by the cluster's nodes

• collection of metrics and information from multiple sources: Prometheus and K8s API server• monitoring of the infrastructure (CPU time, memory, disk subsystem, network) as well as the app's logic (health-status pods, the number of available replicas, information on liveness/readiness tests)

Part 1: What Is "A Plugin for Grafana"?
From a technical point of view, a plugin for Grafana is an Angular controller stored in the Grafana data directory (/var/grafana/plugins//dist/module.js); it can be uploaded as a SystemJS module. A plugin.json file should also be located in this directory, and the file should contain all meta information about your plugin: name, version, type of plugin, links to the repository/site/license, dependencies, and so on.

module.ts

plugin.json

As you can see in the above screenshot, we have specified plugin.type = app. Plugins for Grafana can be of three types:
panel: The most common type of plugin, this is a panel for visualizing any metrics, and is used to build multiple dashboards.

datasource: A plugin connector to any data source (for example, Prometheus-datasource, ClickHouse-datasource, ElasticSearch-datasource).

app: A plugin that allows you to build your own frontend application inside Grafana, create your own HTML pages, and manually access the datasource to visualize different data.

Plugins of other types (datasource, panel) and various dashboards can be used as dependencies.

An example of a dependency of a plugin with type = app.

JavaScript or TypeScript can be used as the programming language (we chose TypeScript for our plugin). You can find blanks for hello-world plugins of any type here. In the repository there is a large number of starter packs (there's even an experimental example of the plugin on React) with pre-installed and pre-configured crawlers.

Part 2: Prepping the Local Environment
To work on the plugin you definitely need a Kubernetes cluster with all the pre-installed tools: prometheus, node-exporter, kube-state-metrics, and grafana. The environment should be set up quickly, easily, and naturally, and Grafana should be mounted directly from the developer's machine to enable a hot reload of the data directory.

In our opinion, the most convenient way of working with a local Kubernetes instance is minikube. Our next step is to install a combo of Prometheus + Grafana with the help of prometheus-operator. Here you will find an article describing how to install prometheus-operator on minikube. To enable persistence, you must set the parameter persistence: true in the file charts/grafana/values.yaml, add your own PV and PVC, and specify them in the parameter persistence.existingClaim.

Our final startup script of minikube looks like this:

minikube start --kubernetes-version=v1.13.4 --memory=4096 --bootstrapper=kubeadm --extra-config=scheduler.address=0.0.0.0 --extra-config=controller-manager.address=0.0.0.0 minikube mount /home/sergeisporyshev/Projects/Grafana:/var/grafana --gid=472 --uid=472 --9p-version=9p2000.L

Part 3: Direct Development
Object model
As part of the preparation process for developing the plugin, we decided to describe all the underlying entities of Kubernetes we will work with in the form of TypeScript classes: pod, deployment, daemonset, statefulset, job, cronjob, service, node, and namespace. Each of these classes inherits from the common class BaseModel, which describes a constructor, a destructor, and methods to update and toggle visibility. In each of the classes, nested relationships with other entities are described; for example, a list of pods for the entity of deployment type.

import {Pod} from "./pod"; import {Service} from "./service"; import {BaseModel} from './traits/baseModel'; export class Deployment extends BaseModel{ pods: Array; services: Array; constructor(data: any){ super(data); this.pods = []; this.services = []; } }

With the help of getters and setters, we can display or set the desired metrics of the entities in a convenient and readable form. For example, here is the formatted output of allocatable cpu nod:

get cpuAllocatableFormatted(){ let cpu = this.data.status.allocatable.cpu; if(cpu.indexOf('m') > -1){ cpu = parseInt(cpu)/1000; } return cpu; }

Pages
The list of all the pages of our plugin is originally described in our plugin.json under the dependencies section:

In the block for each page we should specify the PAGE NAME (it will then be converted to a slug, and that's how the page will be available), the name of the component responsible for the page (the list of components will be exported to module.ts), the user role that can work with the page, and navigation settings for the side panel.

In the component responsible for the page, we should set a templateUrl containing the path to the HTML file with markup. Inside the controller, using dependency injection, we can get access to two important Angular services:

• backendSrv — a service providing interaction with the Grafana APIserver;

• datasourceSrv — a service providing local interaction with all the data sources installed in your Grafana (e.g., .getAll() returns a list of all installed datasources, and .get() returns an object instance of a specific datasource).

Part 4: Data source
In Grafana, a data source represents exactly the same plugin as all the others: it has its own entry point module.js, a file with metadata called plugin.json. When developing a plugin with type = app we can communicate both with existing data sources (for example, prometheus-datasource), and with our own data sources, which we can store directly in the plugin directory (dist/datasource/*) or install as a dependency. In our case, the data source is supplied with the plugin code. Also, a template config.html and controller ConfigCtrl are required for the configuration page of a data source, while a data source controller will implement the logic of your data source.

In the KubeGraf plugin, from the point of view of the user interface, the data source is a Kubernetes cluster that has the following features (source code available here):

• collecting data from the K8s API server (getting the list of namespaces, deployments, etc.)

• proxying requests in prometheus-datasource (which is selected in the plugin settings for each specific cluster), and formatting the answers to use the data both in the static pages and in the dashboards

• updating the data in static pages of the plugin (with a fixed-time refresh rate);

• processing requests to fill out a template sheet in grafana-dashboards (method .metriFindQuery())

• performing a connection test with the final K8s cluster:

testDatasource(){ let url = '/api/v1/namespaces'; let _url = this.url; if(this.accessViaToken) _url += '/__proxy'; _url += url; return this.backendSrv.datasourceRequest({ url: _url, method: "GET", headers: {"Content-Type": 'application/json'} }) .then(response => { if (response.status === 200) { return {status: "success", message: "Data source is OK", title: "Success"}; }else{ return {status: "error", message: "Data source is not OK", title: "Error"}; } }, error => { return {status: "error", message: "Data source is not OK", title: "Error"}; }) }

Another interesting thing here, from our point of view, is implementing an authentication and authorization mechanism for the data source. As a rule, we can use a built-in feature of Grafana, datasourceHttpSettings, right out of the box to configure access to the final HTTP data source. We just need to indicate the URL and the basic authentication/authorization settings: login and password, or client-cert/client-key. In order to implement the ability to configure access using a bearer token (the de-facto standard for K8s), we had to doctor it up a bit.

To solve this issue, you can use the built-in tool, Grafana "Plugin Routes" (read more on the official documentation page). In the settings of our datasource, we can declare a set of routing rules to be processed by the Grafana proxy server. For example, for each individual endpoint, you can put headers or URLs, and create a template if you like; the data for that can come from the fields jsonData and secureJsonData (to store passwords or tokens in encrypted form). In our example, queries of the type /__proxy/api/v1/namespaces will be proxied at a URL of the type /api/v1/namespaces with the header Authorization: Bearer.

Of course, to work with the K8s API server, we need a user with read only access. You can find access manifests to create one in the plugin's source code.

Part 5: Release
As soon as you write your own plugin for Grafana, you'll definitely want to share it as open source. For that purpose, you can use the Grafana plugin library, available here: grafana.com/grafana/plugins.

If you want your plugin to be available in an official store, you should make a PR and put it in this repository, then add to the repo.json file the following content:

where version is the version of your plugin, url is a link to the repository, and the commit is a hash of the commit where that version of the plugin will be available.

And finally, you will see a beautiful sight:

The data for this page will be automatically collected from your Readme.md, Changelog.md, and the plugin.json file with the plugin description.

Part 6: In Conclusion? Not Yet.
Even though we’ve released the plugin, we’re definitely not done developing it. What are we doing now? We’re working on perfecting the monitoring of node cluster resource utilization, and crafting some new features to improve the UX. Also, we’re carefully processing tons of feedback, which we received after the plugin was installed both by our clients and those who requested it on github (if you share your issue or pull request there, we'll be super happy :-) ).

We hope this article will help you understand Grafana (it’s such a wonderful tool), and maybe even write your own plugin. Thank you!

How to Prepare Your Site for Heavy Traffic

Evgeny Potapov — Wed, 25 Mar 2020 12:05:54 GMT

1. Monitor your infrastructure. First of all, you should know what's happening with your website. If you're experienced with Prometheus/Grafana, you could use them, but if you’re not, it's not a problem; you can use any monitoring service, such as datadog, and set it up really quickly. If it's still hard, use pingdom or site24x7, at least to check that your website is still available. Remember, you can control what you want to measure, and the most important thing is that if you don't know what's happening inside your system and exactly where it's happening, you can't fix it.

Remember, there are multiple possibilities of what could go wrong when you get hit by traffic:

You're bound by CPU resources
You're bound by RAM limits
You're bound by your HDD/storage performance
You're bound by the bandwidth on your cloud instance/cluster/server

2. Prepare to scale at 60-80% of maximum load. Whenever you see that you've reached 80% of your resource limits, you should start scaling. When you reach 100%, you'll be down, and it will take time to recover (not to mention it will be very stressful). You should act fast, because you’ll be losing your users, and you might make more mistakes when you're in a hurry. When you reach 80% of your load, scale until you get it down to 40%, then repeat as necessary.

3. Keep an eye on HDD performance and bandwidth limits, not only CPU and RAM. It's harder to discover the problem when your performance is hit by IOPS (input/output operations per second) or net bandwidth limits.

4. Watch your database performance, especially when you're using a cloud database. RDS, Cloud SQL, MongoDB Atlas and other services are managed by the cloud by they have their own limits and you should watch them and scale when necessary.

5. When your DB hits a CPU check for indexes, that might really help.
Adding indexes dramatically reduces CPU load. Say you’re using 90% of your DB CPU. You might want to scale the server 2x CPU to handle 2x load, but if most of your queries are unindexed, adding indexes might reduce your CPU load by 10x, so it’s worth investigating.

6. Keep an eye on your cloud bills. It's easy to forget about your bills when you’re in a rush. Set up budget alerts in your billing system. Bandwidth is especially pricey. If you're unable to move your content to a CDN or to dedicated hosting services like 100tb.com or leaseweb, the prices are still high.

7. Avoid state in your app. Though it's possible to scale CPU and RAM resources in the cloud, there is still a limit that you can't overcome. At that point, you’ll want to scale horizontally by adding new instances of the same app—but your app should be ready for it. When you have multiple instances of the same app, your users' requests are distributed across multiple servers, so you can't store the data on a local disk.

8. Consider moving to the cloud if you're on a dedicated hosting. You can’t easily scale when you’re using dedicated hosting; it would take time to add more servers. It could take anywhere from a couple of hours to a couple of days to get new servers available, and usually you pay by the month, not by the hour. You don’t want to wait hours or days if you’re already down. It’s much easier to scale in the cloud.

9. Tune your infrastructure. There are some basic things that are disabled by default that you might want to configure in your OS, network layer, app management, and programming language manager; they might reduce your resource usage dramatically. Google for “your-tech-stack tuning” and follow the basic recommendations.

10. Be ready to start a minimal/cached version. Despite any of your efforts, if you get a 100x spike in traffic, you’ll be down. It takes time to scale up, so be ready to serve a static cached version. You might use Cloudfront/Cloudflare cache for this, or your CDN cache, nginx cache, or anything else. Just make sure that you’re able to do it when you need to.

DevOpsProdigy KubeGraf: Revised K8S monitoring in Grafana

Sergey Sporyshev — Thu, 26 Dec 2019 19:07:44 GMT

We’re excited to present our latest in-house solution! DevOpsProdigy KubeGraf is a Grafana plugin that allows you to monitor K8s. It’s an updated, advanced version of the official Grafana K8s App plugin.

The official plugin lacked a few significant options that we implemented in KubeGraf:

Authentication and authorization via bearer token to work with the K8s API to install the plugin with read-only access for cloud K8s solutions, including Amazon AWS, Google Cloud Platform, DigitalOcean, etc.
Support for the latest K8s versions, from K8s 1.12 to K8s 1.17
Support for the latest versions of Node exporter, kube-state metrics (to create dashboards)
Monitoring of StatefulSets

Plugin’s key features:

Cluster Health status:
Provides a brief overview of any serious issues happening with your K8S cluster such as heavy resource usage, pod readiness/liveness, etc.

Applications Overview:
Detailed Service Map describing current information on Deployments/Statefulsets/Daemonsets/Cron Jobs/Jobs/Services and the relations between them, arranged by Deployments and Namespaces in one place.

Nodes Overview:
A visual presentation of cluster metrics and cluster map:

A map of pod distribution according to cluster’s nodes, node info and details about cluster node resources

Node statistics and graphs

Pod statistics and graphs

Deployment/Statefulset/Daemonset status with details about available replicas, container status, resources

Node statistics
App health status

The plugin is available on Grafana and on GitHub.

Join us on Slack!

Who the hack is FinOps? Or proven tips to cut infrastructure costs

Anton Baranov — Wed, 18 Sep 2019 12:42:14 GMT

You’ve probably wondered before how much the infrastructure of your project costs you, and you’ve definitely noticed there’s no linear relation between the cost and the workload. Many CEOs, CTOs, and developers suddenly realize that they spend way too much on infrastructure. But what exactly costs so much?

Usually, cost minimization ends up with buying the cheapest solution or AWS support plan, or optimizing the hardware configuration when it comes to physical racks. Plus, there’s no clear-cut rule to determine who might end up handling that task. In a startup, it would probably be the lead programmer, who already has a whole slew of headaches. In larger or more established companies, for example, the CMO, CTO, or even the CEO in tandem with an accountant would have to deal with it. In short, it’s not unusual to see infrastructure costs skyrocket while the situation is being handled by busy, confused staff.

When you need paper towels for the office, your office manager or outsourced cleaning service personnel will take care of it. When you need programming services, call the CTO. When you need to boost sales...well, you get the point. However, since way back, when a "server room" was not a room, but a closet with a tower holding a couple of hard drives in a RAID, everyone (ok, most of us) has turned a blind eye to the fact that there is no specially educated person who knows all the ins and outs of power capacities.

Unfortunately, my experience suggests that this task has been constantly delegated to random people: first noticed, first charged. It's only recently that the FinOps position began to take some concrete shape in the industry. FinOps is a specially trained person tasked to control the purchase and exploitation of power capacities, and, therefore, revision of a company's expenditures on infrastructure.

I’m not saying you should get rid of expensive and effective solutions. At the end of the day, every company must decide for itself what hardware and/or cloud-based systems it needs to function reliably. But you must admit that mindlessly writing checks for hardware support without further monitoring and cost-benefit analysis is not a sustainable solution.

Who the hack is FinOps?

Let's say you have a solid company, breathlessly referred to as an"enterprise" by your salespeople. In the beginning, you probably followed the standard procedure of buying dozens of servers, AWS, and some extra bits and bobs here and there. In a big company, there is constantly something going on — some teams grow, others break up or move to different projects. Thus, having extra facilities makes sense. But, the mix of these two factors will eventually make your accountants tear out their hair when they see yet another invoice for infrastructure.

What’s the best way out for the accountant? Buy a wig, pretend to be Moby, or dig into the source of that long string of zeros on all those invoices?

Let's be honest: the approval and payment process within most companies is far from perfect. However, though an unattended rack in a server room will eventually be noticed by a watchful system administrator, there’s no such guarantee for a cloud-based service with automatically recurring payments. Such a service can quickly outgrow its usefulness while the charges steadily rack up. But the saddest thing is that the team next door might be going into a tailspin because an accountant keeps rejecting their request for that same cloud-based system.

What's the most obvious solution? Let the needy teams do the management? No way! Relations between different project teams are awfully tenuous in most tech companies. One team might simply not know that the second team has a ton of value in mothballs.

Who’s to blame? — Well, no one. That's the way the cookie crumbles!

Who suffers? Well, the whole company.

Who can fix it? FinOps, ta-da!

FinOps is not just a third wheel between the developers and the necessary equipment. It is a person or a team that knows what, where, and how well the company’s data is stored. In fact, these people must work in tandem with devops, on the one hand, and the bean counters, on the other, by fulfilling the role of an efficient mediator and, most importantly, analyst.

Cloud solutions. They are relatively cheap and very convenient. (They stop being cheap, however, when the number of servers hits three digits.) Plus, cloud solutions allow you to use more and more services that were previously unavailable, including database as a service (Amazon AWS, Azure Database), serverless applications (AWS Lambda, Functions Azure), and many others. They are all very cool in principle: pay, install, and enjoy! The deeper the company merges into the clouds, however, the worse are the CFO's heebie-jeebies. (The faster the CEO loses her hair, too.)

The invoices for various cloud services are always quite messy: transactions with one expenditure item sometimes require multi-page transcripts. Unfortunately, they don't help you puzzle it out. There are even services that translate cloud invoice gibberish into human language; see, for example, cloudyn.com or cloudability.com.

So, what should a FinOps do?

Clearly understand what kind and how many cloud solutions were purchased, and when
Know how purchased capacities are used
Re-distribute them according to the needs of departments or teams
Safeguard against buying capacities just for "being on the safe side"
And, as a result, save your company money

Let's look at the example of storing a cold backup DB in the cloud. Say you archive it in order to reduce the space and traffic needed for a repository update. One such operation is a dime a dozen, but hundreds of them cost a pretty penny.

Or, let's look at another situation: let’s say you purchased reserve capacity on AWS or Azure to prevent failing under peak traffic. Can you be sure that’s the optimal solution? Usually, those backup servers are idle 80% of the time. That means you’re just giving your money to Amazon. Considering that the same AWS and Azure provide burstable instances for such scenarios, why should you pay for idle servers if you can use a tool that’s specifically provided as a peak traffic solution? Or, replace your ‘On Premise’ instances with ‘Reserved’ instances — they cost much less.

A few words about cash

As I said in the beginning, procurement tasks are often delegated to some random people, who are later left to themselves. Usually, those people end up deciding how much and what to purchase on a tight schedule and without advice from top managers.

Meanwhile, the company could have saved tons of money on discounts and bulk purchase reductions if only those people had spent more time, stepped out of their comfort zone, and had consultations with the cloud service salesperson. Instead of automatic selection of options on the Amazon website, a one-on-one talk with a real sales manager must take place. So, your people need to have the time, authority, and skills to scout and negotiate the best offers.

Don't forget that AWS and Azure are not the only pebbles on the beach; there are solutions from other providers. Google, for example, introduced the Firebase platform, where you can locate a project that requires fast scaling on a turnkey basis. This solution makes sure repositories, real-time DB, hosting, and cloud-based data synchronization are available in one place.

On the other hand, if you have a combination of monolith projects, a centralized solution might not be an option. A long-lived project with its own history of development and a considerable amount of data requiring a repository deserves fractionary accommodation.

When optimizing spending on cloud services, you may suddenly realize that critical applications are worthy of a more hefty expenditure that will ensure your company's revenue continues to roll in. Developers' "heritage," old archives, and databases, on the other hand, don't need costly cloud solutions. A standard data center with an average HDD and medium hardware with no bells and whistles will be more than enough.

If you think this "trifle" doesn't deserve your attention, please remember that all big problems begin when people in charge ignore little things and do the job the fastest and easiest way they know. Then, in a few years, multi-page invoices will start to arrive, and the hair-tearing will commence.

Instead of a summary...

In short, cloud solutions are cool in their ability to meet the needs of businesses of vastly different scale. Unfortunately, we still haven’t perfected a culture of consumption or control over such an innovation. FinOps represents an organizational lever arm that can help you better use cloud capacities. But please don't turn this position into a firing squad that will be tasked to trace careless developers and rap their fingers for idle capacities.

Developers should develop, not count the company's money. It is FinOps who need to make the process of cloud capacities purchasing, disposal, and transfer a simple and enjoyable time for all teams involved.

Failover in Kubernetes: It Does Exist!

Sergey Sporyshev — Fri, 23 Aug 2019 08:06:28 GMT

On the one hand, monitoring and failover management are the pillars of any project's availability. On the other hand, you might question if failover management in Kubernetes is needed at all. Indeed, everything seems to be self-balanced, self-scaled, and self-restored in Kubernetes. The system looks like a magical fairy that saves you from all infrastructure problems and guarantees your project never fails. So, instead of asking "when" and "how," most of you would ask "why do failover cluster in K8s?" Unfortunately, our fairytale, like any other, turns into stark reality with the chime of bells.

At DevOpsProdigy, I spend a great deal of time consulting different teams about the pros and cons of various DevOps solutions. Most of them center around Kubernetes, and I want to share a few thoughts about how DevOpsProdigy ensures high-availability in Kubernetes. Don't take these as strict guidelines, just as a few insights from past mistakes.

In the good old days of dedicated servers and bare-metal solutions, with identical virtual or hardware servers, we used to apply three basic approaches:

synchronize the code and static methods
synchronize configs
replicate data stores

And there we go. We can switch to a failover replica whenever we need to! Everyone is happy, Cinderella can go to the ball!

What traditional options are there to secure high availability for our K8s application? First of all, the instructions tell us to install many machines and create many master replicas. Every master should have etcd, API, MC, and scheduler enabled, and their number should be enough to reach a consensus within the cluster. In that case, our cluster will rebalance and keep working perfectly even if several replicas or masters fail. Magic is in the air again!

But what if our cluster is located in a single data center? Imagine a mechanical shovel cuts our cable, a lightning bolt strikes the data center, or climate change brings on a second Genesis flood? We are doomed and our cluster vanishes. What magic tricks will our fairy have for such scenarios?

First and foremost, do keep one more failover cluster to be able to switch to it at any moment. The infrastructures of both clusters must be identical. All non-standard plugins for the file system work, plus custom solutions for ingres, etc. will be carbon-copied for two or more clusters, depending on your budget and DevOps capacities. It is important, however, to clearly define two sets of all applications — deployments, statefulsets, daemonsets, cronjobs, etc. — and specify which of them will work continuously in backup mode, and which will be deployed only after you switch to a failover cluster.

So, here is the question: will our failover cluster be identical to our production cluster? No, it will not. I think our previous routine of copying everything for monolithic projects and hardware infrastructure does not work in K8s. A one-fits-all approach won’t work, and here’s why.

Let's start with the basic K8s functions.

Deployments will be identical. Applications that can catch outgoing traffic will be running all the time.
As for the config files, we will also decide on a case-by-case basis. It’s better not to keep a database in K8s and set access to the working DB in the configmaps (the failover process for the working DB, by the way, will be developed separately). Accordingly, we need to have a separate configmap file to provide access to a failover DB.
The same applies to secrets; i.e., passwords to access DB and API keys. Either a production or a failover secret must work at any one time. So here we have two K8s entities whose failover copies are identical with the production copies.
Cronjob is the third. Never will cronjobs on failover mirror the set of production cronjobs! Let's look at an example. If we deploy a failover cluster with all enabled cronjobs, our clients will receive, for instance, two email notifications instead of one. In short, any synchronization with external sources will be implemented twice. No one wants that, right?

How does the Internet recommend to organize a failover cluster? The second most popular answer on "Why do failover cluster in K8s?" is to use Kubernetes Federation.

What is Kubernetes Federation? Let's call it a big meta-cluster. Remember the K8s architecture with its master and several nodes? Well, every node in Kubernetes Federation is a separate cluster. Similarly with K8s, in K8s Federation we work with the same entities and primitives, but juggle with separate clusters, not machines. K8s allows you to sync resources in multiple clusters. You can be sure that every deployment across the K8s Federation will exist in every cluster. Plus, the Federation allows you to customize resources wherever necessary. If we change a deployed configmap or secret in one cluster, the other clusters will stay unaffected.

K8s Federation is a pretty young tool which doesn't support the entire set of K8s resources. When the first version of the documentation saw the light, it claimed to support only configmaps, ReplicaSets deployments, and ingress, excluding secrets and volume. It is indeed a very limited set of resources, especially if you like to have fun and pass your own resources to K8s via custom resource definition, for example. On the bright side, the K8s Federation provides flexibility in managing our ReplicaSet. If, for example, we want to run ten replicas of our application, K8s Federation will divide this number proportionately among the number of clusters by default. And, the good news is that we can still configure all of them! You can specify that your production cluster needs to have six replicas of our application, with the remaining four replicas on the failover cluster to save resources or to experiment. Although it is also quite convenient, we still have to search for new solutions, adjust deployment, etc.

Is it possible to approach the failover process in K8s more easily somehow? What helpful tools do we have?

First, we always have some kind of CI/CD system that generates yamls for our containers so we don’t have to create/apply them manually in our servers.

Second, there are several clusters as well as a few (if we're smart enough) registries we have backed up too. Not to mention a wonderful kubectl that can work with multiple clusters simultaneously.

So, in my opinion, the simplest and smartest decision for creating failover clusters is a primitive parallel deployment. If there is some pipeline in the CI\CD system, we will, first, build containers, then test and roll out applications via kubectl in several independent clusters. We can build simultaneous calculations on several clusters. Accordingly, at this stage, we can also set deployment configurations. You can define a set of configurations for our production and failover clusters, roll out a production environment in the production cluster on the CI/CD system, and roll out a failover environment for the failover cluster. Unlike in K8S Federation, we don't need to check and re-define resources in separate clusters after each deployment. It has been done already. We can be proud of ourselves.

There are, however, two serious concerns. The file system is the first one. Usually, we have either a physical volume or an external repository. If we store our files in PV inside the cluster, we better use good old-fashioned Isync or our preferred way to sync the files. Roll it out to all machines and prosper!

The second obstacle is our database. Again, good guys do not keep their database in K8s. Data failover, in that case, is organized by the same token: master-slave replication followed by changing the master. Finally, verify that the copy is up and running, and go dancing! If, however, we keep our DB within the cluster, there are many ready-made solutions to deploy a DB inside K8s by organizing the same master-slave replica.

There are gazillions of presentations, posts, and books about DB failover, and I can add nothing new here. Just one piece of advice: follow your dreams, develop your own complicated hacks, but please, please think through your failover scenarios.

Now, let's dig into the very process of switching to the failover site in case of the Apocalypse.

First of all, we deploy our stateless applications simultaneously. They do not affect the business logic of our applications or our project. We can constantly keep two sets of applications running, and they can start balancing the traffic.

Second of all, we will decide whether we need to configure our slave replicas. Let's imagine we have a production and a failover cluster in K8s, plus an external master database and failover master database. There are three potential scenarios of how those applications can start interacting with each other in production.

The DB may switch over and we will have to switch traffic from production to failover DB.
Our cluster may fail and we will have to switch to the failover cluster while continuing working with the production DB.
In addition, both production cluster and production DB may shut down and we will switch to the failover cluster and failover DB, then redefine our configs to make our applications work with a new DB.

What conclusions can we draw from all this?

First and foremost, to live with a failover settings is to live happily. It's expensive, though. Ideally, there must be more than one failover. In a perfect world, a few failovers should be enough. One failover will be located in one data center, and the other one via a different hosting provider. Believe me — I found it out the hard way. Once there was a fire in a data center and I suggested to switch to a failover. Unfortunately, the failover servers were located in the very same spot.

The second and last conclusion: if your application in K8s connects with some external sources (database or some external API), you must set it as a service with an external Endpoint. In that case, when you switch from your database, you won't have to deploy all your dozens of applications that reference that same database. Define your database as a separate service and use it as if it's inside the cluster. If the DB fails, you will only need to change the IP in one place, and continue to live long and prosper.

Database Containerization in Kubernetes

Sergey Sporyshev — Wed, 31 Jul 2019 11:36:40 GMT

Historically, any issue can split the tech industry into two camps: "for" and "against." Moreover, the matter of dispute can be completely arbitrary. Which OS is better, Win or Linux? Which is best for smartphones, Android or iOS? Should you store everything in the cloud or keep it in cold RAID storage? Can PHP experts call themselves developers? Most of the time, these disputes are purely philosophical in nature and don't have any empirical basis, giving rise to a lot of hair splitting.

It’s not surprising that with the advent of containers, Docker, or K8s and the like, there have been lots of arguments for and against using new inventions in various areas of the backend. (Note that in this article Kubernetes serves as a generalized orchestrator. You can substitute it with the name of any other orchestrator that you find the most comfortable and familiar.)

It could have been another simple dispute about two sides of the same coin. For example, it could have been the same senseless and merciless confrontation like the one between Win vs Linux, in which most normal people stand somewhere in the middle. Unfortunately, it's not that simple. In the dispute over whether or not store databases (DB) in container systems, there is no right side. Because, in a certain sense, both supporters and opponents of this approach are right.

The Pros

Let's look at the pros. Say you have a major web project. It might initially be based on a microservices approach, or it might at some point turn out that way — it's not very important, really. You spread your project over several microservices and set up orchestration, traffic balancing, and scaling, and then you think it's time to sip mojitos in a hammock instead of having to recover failed servers. Not so fast! Very often, the application code is the only thing that’s containerized. But what else is there besides the code?

Bingo! It's data. Any project's core is its data. It can exist as a typical DBMS (MySQL, PostgreSQL, and MongoDB), a storage used for search (ElasticSearch), a key-value store (redis), and so on. Now, let's omit poorly realized backends where queries can crash the DB, and instead talk about DB fault tolerance under clients' traffic. After all, when we containerize our app and allow it to scale freely to process any number of incoming requests, the load on our DB increases.

In fact, the request channel to our DB and the server where our DB is stored are like the eye of a needle that leads into our beautiful containerized backend. And don't forget that the main reason to virtualize a container is to make the structure mobile and flexible, which, in its turn, will allow you to organize load balancing across the available infrastructure as efficiently as possible. So, if we don't containerize all the elements of the system, including the DB, across the cluster, we’re making a very serious mistake.

It makes more sense to clusterize not only the app itself, but the services responsible for data storage. If we, let's say, prepare our web servers to clusterize by spreading them across different tables and databases in one monolith DBMS, we immediately solve the problem of data synchronization, for example, comments on posts. Anyway, we obtain an intra-cluster (albeit virtual) DB view like ExternalService. The DB itself, however, is not yet in the cluster. In fact, the web servers which we deployed in K8s, update information from our static production base, which operates separately.

See the catch-22? We use K8s or Swarm to balance the load and prevent our primary web server from failure, yet we don't do the same for the DB. What use can we get out of empty web pages that return access errors to a database that fails?

That is precisely why we need to clusterize not only web servers, but also DB infrastructure. It's the only way we can create a structure whose elements can work in concert, yet independently from one another. Even if half of our backend fails due to high traffic, the rest will survive. Plus, a DB synchronization system within the cluster and the opportunity to scale and deploy new clusters without limit will help to achieve the required capacities. The only limit is the number of racks in the data center.

In addition, a clusterized DB is portable. In the case of a global service, it's quite illogical to locate your web cluster somewhere in San Francisco and move data packages back and forth to New York for every DB request.

Also, DB clustering allows you to build all system elements at one level of abstraction. That, in its turn, makes it possible for devs to control the system directly with the code and without the active involvement of the ops. Want to create a separate DBMS for a new sub-project? Piece of cake! Write a yaml file, load it into the cluster, and voila!

And, of course, the internal administration becomes significantly easier. How many times have you winced when new colleagues thrust their greasy fingers into your production DB? The one and only production database! Of course, we're all adults here, and you probably have several backups here and there and maybe in cold storage, because you’ve seen a DB apocalypse before. But still, every new team member with access to the production infrastructure and database is a bucket of valium for all team leaders. It's scary, right?

Containerization and geographical distribution of your project DB helps to avoid such terrifying moments. Newcomers are not trustworthy? Okay! Let's give them a separate cluster to work on, unplug it from the rest of the DB, and sync clusters only by a manual push and simultaneous turning of two keys (one by the teamlead, another by the system administrator). Everyone's happy!

Now, let's play devil’s advocate and reveal all the disadvantages of DB containerization.

The Cons

To discuss why we should not containerize a DB and continue to run it on central server replicas, let's not sink into standard arguments of “that’s the way it is.” Instead, let’s think of a time when containerization really brought tangible benefits.

Realistically speaking, the number of projects that really need to containerize the DB can be counted on one hand.

In most cases the very use of K8s or Docker Swarm tend to be redundant. Quite often these tools are used due to the widespread hype over clouds and containers. Most people think it’s cool.

Again, using K8s or Docker for a project is usually above and beyond what's needed. Sometimes DevOps teams or outsourced specialists don't pay attention to that fact. Sometimes — and this is way worse — DevOps teams are compelled to use containers.

Many people think that the Docker/K8s clique is simply moving in on DevOps teams that prefer to outsource the resolution of infrastructure issues. In fact, working with clusters requires engineers who understand the architecture of the implemented solution and know how to operate it. At DevOpsProdigy, we once taught a client — mass media platform Republic — to work with K8s. They were happy, we were happy. It was honest. Most often, however, K8s promoters take the client's infrastructure hostage. While they know the ins and outs of the container system, the client's team doesn't know beans about it.

Now, let's imagine that the outsourced DevOps engineer receives access to not only the web server, but also DB maintenance. Remember, the DB is the core of any project. Losing it will be fatal for any living species. The prospects are far from positive. So, instead of giving in to K8s hype, most teams had better use a good AWS package, which will solve all load balancing problems with their site or project. Here, I expect somebody to respond that AWS is no longer cool enough... Well, there are show-offs everywhere, including tech industry.

Perhaps clustering is indeed necessary for some projects. While there will be no concerns about stateless applications in that case, clustering the DB and the following organization of a decent network connectivity for it raises a lot of questions.

A seamless engineering solution like K8s will still cause some headache, i.e. data replication in a clusterized DB. Some DBMSs are originally quite loyal to the distribution of data among its separate instances. Many others are not that friendly. So, quite often, the ability to replicate with minimal resource and engineering costs stops being the main argument when choosing a DBMS for our project, especially if it was not originally planned as a microservice architecture project, but simply turned out that way.

And, speaking of virtual file systems: unfortunately, we can't call Docker Volumes problem-free. In general, reliable long-term data storage requires the simplest technical schemes. Adding a new abstraction layer from the container file system into the host file system is risky enough as it is. When, however, there are also problems with transmitting data between these layers, it's big trouble indeed. The more complex the process is, the easier it gets destroyed.

Considering all these issues, it's much more beneficial and easier to keep the DB in one place, even if you need to containerize your app. Let it run on its own and have a simultaneous connection to the DB, which will be read-written once and in one place, via a distribution gateway. Such an approach reduces the risk of mistakes and unsynchronization to naught.

To sum up, DB containerization is appropriate only where there is a real need for it.

Conclusion

If you’re looking for a black-and-white conclusion about whether to virtualize your DB or not, I regret to say there is none. When creating any infrastructure solution, you should follow common sense, not hype or tech innovations.

There are projects that perfectly incorporate K8s principles and tools; such projects find harmony at least in the backend part. There are also projects that need normal server infrastructure, not containerization. The reason is that they can't re-scale to the microservice cluster model — they will simply fail if they do so.

DevOpsProdigy Blog

The ETL process development cycle

Why we recommend building ETL processes with CI/CD

Tasks that require ETL

Transferring, processing, and storing data: Kafka, Spark, and Greenplum

An infrastructure for CI/CDimplementation

GitLab for managing code and repositories

Jenkins for setting up a CI/CD pipeline

SonarQube for static code analysis. Unit testing

JMeter for assessing the data processing capabilities of an ETL task

Identifying potential issues found at different stages of the CI/CD pipeline

An example of a software development cycle based on our recommendations

Conclusion

Why businesses want DevOps, and what DevOps engineers need to know to communicate with them effectively

Some history of software development

What exactly can DevOps engineers do to help businesses?

Making expectations reality

How we chose the right time series database for us: testing several TSDBs

How I Didn't Make It to London but Still Attended the London DevOps Enterprise Summit

Monitoring Microservice Applications: An SRE's Perspective

Six Ways to Build Docker Images Faster (Even in Seconds)

Using Alpine Linux

Using multi-stage builds

Changing the order of layers

Using previous images as a cache source

Using pre-built Docker images

Kubernetes, Microservices, CI/CD and Docker: Learning Tips for Old School People

Things to research, understand and accept

Steps to learn the modern stack

What’s next?

Resilience Engineering: Lessons from REDeploy 2019

Developing a Plugin for Grafana: Story of Struggles and Successes

How to Prepare Your Site for Heavy Traffic

DevOpsProdigy KubeGraf: Revised K8S monitoring in Grafana

Who the hack is FinOps? Or proven tips to cut infrastructure costs

Who the hack is FinOps?

A few words on refinement

A few words about cash

Instead of a summary...

Failover in Kubernetes: It Does Exist!

Database Containerization in Kubernetes

The Pros

The Cons

Conclusion