Over the last few years, we’ve taken every possible opportunity to talk about what DevOps is. It may seem like it’s getting tedious, but the fact that the DevOps discussion is still going on is quite telling: there are still unresolved issues. And these issues lie in communication between businesses and DevOps engineers.
I often see how people coming to DevOps from different backgrounds each have their own definition of DevOps and basically speak different languages. At a certain point, it turns out that the stakeholders of a DevOps transformation project don't understand each other, and don’t even understand why they need DevOps at all.
In this article, I'm not going to talk about what DevOps is and what's the right way to define it. Instead, I'll focus on the evolution of IT processes, what businesses want to gain from implementing DevOps, what that entails for DevOps engineers, and how we can bridge the gap between us.
Looking back at the journey we’ve gone through with our customers, I can see how business requirements have changed over the course of the last few years. We have been providing maintenance services for complex information systems since 2008, and at first our customers mostly wanted us to make their websites fault-tolerant. But now they have some radically different requests. Back then, the most important things were stability, scalability, and a resilient production environment. Now, fault tolerance of development platforms and deployment systems is of equal importance.
Large companies have experienced a value shift: a stable development environment has become just as important as a stable production environment.
To better understand how we got here and how our mentality has changed, let’s briefly go over the history of software development.
Some history of software development
I think the evolution of software development principles can be roughly divided into four stages. An important thing to note is that software delivery has existed at all stages, although who was responsible for it has varied throughout its evolution.
Mainframes: 1960s - 1980s
Characteristic features of this stage: From the point when computers had just appeared and up to about 1985, companies developed software exclusively for internal use. Small development teams delivered software with rather limited functionality, which was tailored to the demands of the specific company or government-related organization where it would be used. That limited functionality might be intended for sending people to the Moon, but in comparison to modern services, it doesn't have many use cases.
Users: The employees of the company where the software was developed. Back then, the number of software users was also very limited — for example, 3 Apollo astronauts or 20 people who calculated a government budget, or 100 people who processed population census results.
Software distribution: Via physical media and mainframes. Companies had to produce punched cards, then put them into a computer, and in about 10 minutes the program was ready for use. Software delivery technically was the responsibility of the person who entered the data onto punched cards. If a developer made a mistake, fixing it took a lot of time because that required rewriting and debugging the code, producing new punched cards, and inserting them into machines. All that took days and meant that many people simply had wasted their time. So, mistakes had very negative consequences and sometimes could even result in a disaster.
At this stage, IT as a business didn’t really exist yet. Wikipedia lists only four software development companies founded in 1975. One of them was Microsoft, but back then it was a very small and niche company.
PCs and OOP: 1980s - 1990s
Things started to change approximately in 1985, when personal computers became quite common: Apple Computer, Inc. started manufacturing the Apple II in 1977, the IBM PC was released in 1981, and a bit earlier, DEC minicomputers had gained significant popularity.
Characteristic features of this stage: Software development was turning into a business. The number of users had grown, and that made creating software for sale possible.
In 1979, for example, the first spreadsheet software, called VisiCalc, was introduced. It took on some calculation tasks previously performed by an accountant (this role has now been transferred to Excel). Before that, an accountant entered numbers into a big table on paper and performed calculations using different formulae. If an analyst asked what would change if the revenue in the third quarter would be twice as high, the accountant had to change one value and perform the same calculation again — all on paper.
Users: Other companies. VisiCalc completely transformed the computer industry. Now software was developed for the mass market instead of a specific group of users with specialized requirements. For example, economists and analysts started to buy computers in order to leverage electronic spreadsheets.
Because there were more potential users, and software could be sold to individuals as well as companies, developers had to figure out how to make their software work for a large user base and how to create such complex software in general.
The growing number of users made it necessary to expand functionality. That required expanding development teams as well — a dozen developers wasn't enough anymore. Working on a complex software product required a 100- to 500-person team.
Interestingly enough, each stage has some key books that caused revolutionary paradigm shifts in IT. I think for that stage — when software development as a business began to take hold and development teams started growing in number — those books were The Mythical Man-Month: Essays on Software Engineering and Design Patterns: Elements of Reusable Object-Oriented Software. At this time, two things became clear. Firstly, if you increase the number of developers in a team by four times, it doesn't mean you'll get the result four times faster. Secondly, there are other possible solutions to the scaling problem.
A popular way to deal with the growing complexity of software was object-oriented programming. The idea was that if you took a large application, such as Microsoft Excel, and split it into separate objects, development teams could work on them independently of each other. By dividing the product into parts based on functional elements, you could scale and, as a result, accelerate the overall development of the product. Keep in mind that back then, accelerating the development cycle usually meant reducing it to several years.
The reasoning behind OOP sounds a lot like the reasoning behind microservices. However, at that stage, we still packaged applications into a single file (an .exe file during the reign of Microsoft DOS and, later, Windows) and then delivered it to the user.
Software delivery: Via physical media. At that stage, when software started to be mass-produced, the delivery process consisted of writing your software to floppy disks, labeling them with stickers showing the software name, packing the floppy disks into boxes, and sending them to users in different countries. Also, the number of defective floppy disks needed to be kept at minimum. After all, if we manufacture floppy disks in America, deliver them to Russia, and only then find out that half of them are defective, that means huge losses for the business, and our customers will leave us once and for all.
The cost of a mistake: Customers would demand their money back and would never buy from the same company again, which might ruin the whole business.
The software development cycle was terribly long, because each stage lasted several months:
● planning — 12 months
● development — 24 months
● testing — 12 months
● delivery — 12 months
New software versions were released once every few years, so making mistakes in the code was unacceptable.
The main risk factor, however, was that you couldn't get any user feedback throughout the whole software development cycle.
Just imagine: we've got an idea for an IT product, so we do some tests and decide that users might like our product. But we can't really make sure that it will succeed! We can only write code for two years, then ask some nearby accountants (for example, in Redmond) to install our software (for example, a new version of Excel) and try it out. And that's all we can do to test our product. It might very well turn out that nobody wants it and we've wasted the whole two years.
Or it might turn out that people buy our software, but it’s buggy and doesn't work properly — and because at this stage applications are still physical products that come in boxes which you can bring back to the store, users can easily return our product and decide never to buy any software from us again.
Agile: 2001 - 2008
The next stage came with the adoption of the Internet by the masses in the 2000s.
Characteristic features of this stage: IT businesses were moving to the Internet, but browsers couldn't do much yet.
Microsoft created Internet Explorer, which was provided to all Windows users for free. A huge number of people could now access the Internet. Nevertheless, Microsoft intentionally hadn't optimized Internet Explorer for using dynamic functionality in order to protect their software from competition — e.g., browser-based apps and Netscape (you can learn more about that by reading about the browser wars). So, the Internet was mostly used for downloading files, but that was enough to make businesses move there.
Software delivery: Users could now get software distributions from the Internet.
This made it possible to release updates and new software versions much more frequently — once every few months. Companies didn't have to write software to floppy disks or CDs anymore, because users could download updates from the Internet, and developers could allow themselves to make more mistakes.
The cost of a mistake: The risk for the business was not that high because users could install an update and keep using the software.
Agile emerged at about the same time, so this stage saw the release of certain books on agile software development and extreme programming that are still considered IT management 101: for example, Extreme Programming Explained: Embrace Change, as well as Refactoring. Improving the Design of Existing Code and Test-Driven Development.
The main idea was that, because companies could now deliver software via the Internet, they could shorten the development cycle and release new versions once every six months.
The software development cycle in the beginning of the 2000s looked somewhat like this:
● planning — 2 months
● development — 6–12 months
● testing — 1–3 months
● delivery — a few weeks.
For one thing, rigorous testing wasn't as important as before. Even if 10 percent of users could encounter bugs, it was easier to release a patch rather than spend a year on making sure that the software worked properly for absolutely everyone. This way, companies could also test their hypotheses faster (although in this case faster meant 6–12 months).
Moreover, by spending less on testing and thorough planning, companies could cut costs on these experiments. And experimenting became a key idea of the next stage.
DevOps: 2009 - 2020
Characteristic features of this stage: Installing software is a thing of the past, and any software that needs to be installed is updated via the Internet. The Internet is everywhere. Social networks and entertainment apps that are accessed exclusively through the Internet are gaining popularity. We can now implement complex dynamic functionality that runs in a browser, so businesses take advantage of this opportunity.
Software delivery: Via the cloud. In the previous stage, software was installed on a user's computer, so it had to be adjusted to that environment. Now, we can adjust it to a single computer — our server in the cloud. This is very convenient for us because we have full control over this computer and how our apps run on it. There might be some difficulties with rendering interfaces in a browser, but they aren't too much of a problem anymore in comparison to the issues of the past.
All of that helps us accelerate planning, implementation, and testing. Now, we don't have to be left in the dark for months or even years when it comes to knowing whether our project will make it or not, what functionality users want, and so on. Updating software is possible in almost real time.
Still, in 2006 - 2008, software was developed using the same ideology — an application was regarded as a single entity. While it wasn’t an .exe file anymore, it was still closer to a monoliththat consisted of several closely connected objects. Such software was too unwieldy to be quickly adapted to the changing market.
In order to solve this problem, the same people who brought us OOP suggested splitting applications as well, so that software would be made up of separate apps that communicated with each other. Then it would be possible to expand development teams even more, going from hundreds to thousands of team members, and create new functionality continuously. This would let companies experiment more, test hypotheses, adapt to the market requirements and the behavior of the competition, and keep growing their businesses.
In 2009, the world saw the first presentation on uniting Dev and Ops in order to deploy code 80 times a day. This unity became one of the main values in software development. The development cycle looks completely different now:
● planning — a few weeks
● development — a few weeks
● testing — a few days
● delivery — a few minutes
We can almost immediately fix mistakes and quickly develop new hypotheses. This was also the stage where MVP, now a well-known term, was introduced.
While in the 1970s, developers had almost no room for mistakes, and software was practically immutable (you don't really need to change requirements every time you send astronauts to the Moon), now it is absolutely dynamic. Everyone expects to find bugs in their software, so we must provide IT support and have a team who makes sure that your system works properly regardless of the dynamic changes within it.
In this new stage, for the first time in the history of IT, the software delivery role has truly become an IT job.
Before that, the person responsible for software delivery wasn't considered an IT employee. From the 1970s to the 1980s, this job consisted of simply inserting punched cards into computers, while from the 1980s to the 1990s it was about negotiating with CD manufacturers and taking care of logistics. All of that had nothing to do with software development or system administration.
Those who aren't very familiar with DevOps often think that "DevOps engineer" is just a hip new name for an administrator who’s more involved with the developers. But in business (and in Wikipedia too), DevOps is a methodology that is applied in software development. However, it's not the definition that's the most important — it's what DevOps gives us. And DevOps as a methodology lets us adapt to the changing market ASAP and restructure the way we develop software.
If a business doesn't want to lag behind its competitors, it must get rid of long development cycles with monthly releases and adopt DevOps instead. DevOps transformation here means a complete shift to Agile, from development to deployment. And this is how software delivery becomes a part of the software development process and turns into an IT job.
Because software delivery is connected to handling servers and infrastructure, it seems that this job is better suited for someone with administrator experience. But in this case, we end up with communication problems between DevOps engineers and the business. This is especially true if we're talking about administrators who take part in the DevOps transformation and try to meet the needs of the business, which wants to be more flexible.
Most administrators responsible for fault tolerance have a mantra of "If it works, don't touch anything!" Although their company doesn't launch any rockets, they have the same mentality about stability. But in the new, dynamic environment of today's world, businesses want (regardless of potential malfunctions):
● to go from an idea to a deployed product in a minimum amount of time;
● to test a maximum number of hypotheses in a short time;
● to minimize the impact of errors on production.
Even if something crashes, it's not a problem — we can roll back, fix the cause of the problem, and deploy again. It's better to quickly evaluate our product's chances for success than to invest in something that won't be in demand.
The approach to fault tolerance is changing: we no longer need the current version of our software to remain stable for a long time — we just need to reduce the impact that any errors in the current version might have on the performance of the whole system.
Instead of making sure that every little bit of added code is stable, we should be able to quickly discard unstable code and go back to stable code. This, too, is about flexibility: the value is not in the stability of the code deployed in the infrastructure but in the capability of the infrastructure to be extremely flexible.
What exactly can DevOps engineers do to help businesses?
So, how can we better connect with a business and its values?
Since we're implementing DevOps to meet the needs of the business, we must know whether DevOps engineers do what the business needs them to do. In order to do that, we can implement the following metrics (taken from DORA's State of DevOps Report):
● Deployment frequency — how often you deploy code to your production environment or how often your end users get new releases of your product.
● Lead time for changes — how much time passes from committing code to the repository to deploying it in the production environment.
● Time to restore service— how long your service takes to recover from a failure or crash.
● Change failure rate — what percentage of deployments result in worse user experience and require fixing new issues, for example by performing rollbacks.
These metrics will help you evaluate how efficiently your company leverages DevOps. More than that, DevOps engineers can use them to understand what steps to take in order to help the business.
Deployment frequency
Obviously, the more often a company deploys code, the closer it is to embracing the DevOps transformation. But frequent deployments are scary. However, a good DevOps engineer can help the business to overcome these fears.
Fear #1:We might deploy code that hasn't been tested properly, and then our production environment will crash under the load.
The job of the DevOps engineer: Provide an easy way to roll back and help automate testing in the infrastructure.
Fear #2:We might deploy new functionality that has bugs in it, and implementing it will change the data structure or the data itself so much that a rollback won't be possible.
The job of the DevOps engineer: Cooperate with developers — help them with architectural decisions, suggest effective data migration methods, and so on.
Fear #3:Deployments are complex and take a lot of time. (Note: Our experience tells us that Docker images that take 20 minutes to build are quite a common occurrence.)
The job of the DevOps engineer: Find a way to make deployments and rollbacks fast and speed up the build process.
Lead time for changes
This metric is useful for managers as well — after all, it's a manager's job to organize a workflow where code written by developers is committed and deployed ASAP. But DevOps engineers can help with solving challenges in the organization of such a workflow.
Problem #1: Too much time passes between creating and merging pull requests, for example because not only the pull requests themselves are reviewed, but the submitted reviews are reviewed as well. The root of the problem here lies, again, in the hesitation to deploy code.
The job of the DevOps engineer: Together with the development manager, consider automatically merging pull requests.
Problem #2: Manual testing takes too long.
The job of the DevOps engineer: Help automate testing.
Problem #3: The build process takes too long.
The job of the DevOps engineer: Monitor how much time the build process takes and try to reduce it.
In order to do that, the DevOps engineer must, first of all, understand how software testing works, how to automate it, and how to integrate automated testing into the build process. Second, the DevOps engineer should break down the software deployment pipeline into its individual components and try to figure out which parts can be optimized in terms of speed. You can monitor the whole process, from committing code to deploying it in production, find out how long the build process takes and how long it takes for a pull request to be approved, and then, together with the manager and the developers, figure out where you can save time.
Time to restore service
This metric actually has more to do with SRE.
Problem #1: Locating technical issues is very difficult.
The job of the DevOps engineer: Ensure observability and, together with the developers, set up a monitoring infrastructure and configure the monitoring system to effectively inform you about the performance of your service.
Problem #2: Currently, the infrastructure doesn't allow for easy rollbacks.
The job of the DevOps engineer: Make the necessary changes to the infrastructure.
Problem #3: Migration made performing rollbacks impossible.
The job of the DevOps engineer: Teach developers best practices for fault tolerance as well as for data migration that enables easy rollbacks.
Change failure rate
This metric is also from the domain of management. However, here's a fun fact: failures happen more often if deployments are infrequent.
Unfortunately, I often see how companies decide to start a DevOps transformation and implement Kubernetes and GitOps, but all that doesn't have any impact on their release frequency. Their approach stays the same, so if developing a new version of their product took six months before, it still takes six months now. And when you're writing code that takes months to reach the production environment, such code is much more likely to fail than code that's deployed weekly. This mentality undermines the whole DevOps transformation — if a company wants to adopt DevOps but their development cycle takes six months, that's a big problem.
In this situation, the DevOps engineer must sound the alarm and try to explain to the business, once again, what DevOps is about and how the approach to software development has changed over the last few years.
Making expectations reality
DevOps engineers need to have a clear understanding of what the business needs and work on fulfilling these needs. Here’s what you should keep in mind:
● The stability of the current version isn't as important as, firstly, the stability of the infrastructure in general and, secondly, the ability to roll back to the previous version in case of a failure, isolate the issue, and fix it quickly.
● The stability of the development environment. Its efficiency is critically important, especially if there are hundreds of developers on the team. When developers have to stop working because of issues with the development environment, it's just as bad as downtime in a factory.
● Monitoring the software delivery process is now a part of monitoring the whole infrastructure. If something takes 20 minutes to deploy, try to accelerate it.
● Software delivery speed has become one of the key areas for improvement — ideally, you should have a highly efficient pipeline that works without a hitch.
● A convenient development environment is another key objective. If developers can use the environment without any troubles, they write code faster, deploy it more often, and the quality of the code is better overall.