Cloud-based infrastructure and microservices application implementation have significantly changed the nature of every enterprises’ software applications over the past five years.
Cloud-based infrastructure and microservices application implementation have significantly changed the nature of every enterprises’ software applications over the past five years.
These changes have enabled many new and innovative application features to be implemented that provide a more engaging user experience, especially for mobile applications. Mobile devices – cell phones, tablets, and even laptop PCs – have revolutionized how services and goods can be ordered, how content is distributed and consumed, and how people connect and interact around the world.
The introduction of 5G wireless networks has added more fuel to the mobile fire, making the performance of mobile networks comparable to or even better than that of wire-based broadband communications.
All of these changes have increased software and infrastructure health challenges exponentially. As microservices become more granular, the number of connections between services increases dramatically, and it becomes more complicated to detect performance and stability issues.
Microservices-based software complexity is rapidly stretching DevOps and SRE resources beyond reasonable levels, and they’re struggling to keep up with the software repair and support demands. This situation will continue to get worse, especially because microservices enable new software applications to be built rapidly, increasing the number of issues for new applications along with those that occur in existing software. A better means of ensuring software health is needed to keep up with this exponential growth.
Many practitioners are looking to artificial intelligence and machine learning, especially in the form of AIOps, as a path to delivering better software health.
AIOps can’t automatically fix all software issues and usher in the new era of autonomic systems. However, it can provide a means to automate very repetitive tasks or deal with issues that a machine can handle better, faster. It can also, through learning, provide more well-directed solutions for problems, acting in the capacity of an intelligent assistant. But, to date, AI/ML isn’t creative, which means it can only do some things, not all things.
AIOps leverages AI and machine learning to enable some new capabilities for software issue remediation. It provides a means to resolve software performance and reliability issues that have previously been referred to manual triage to resolve.
In the rest of this article I will define the different types of software issues, before outlining some existing and emerging methods for repairing these issues.
The two types of software issues
There are two primary software issue categories:
1) Operational issues
2) Functional issues
Operational issues are when the application components are working properly and as expected, but issues are caused by infrastructure issues that impact application performance. This could be a lack of resources, such as CPU, memory, storage, or network bandwidth, especially during application scaling. These are the type of issues that are the greatest concern for System Reliability Engineers (SREs).
Functional issues are almost always code issues that impact one or more application components. They can occur in one component or cascade among multiple components along the application transaction path. Functional issues typically always require manual triage efforts to resolve the code issues. These are the issues most frequently addressed by DevOps teams and Developers.
Existing and emerging methods for repairing software issues
Observability plus AI/AIOps is at the forefront of next-generation software health technology. With the expanded use of AI and AIOps, the new forms of software remediation are automatic or semi-automatic prevention and repair:
1. Automated incident prevention
Automated incident prevention consists of remediating application resource issues in the underlying application infrastructure. The measure for determining how fast a system can act to prevent an application resource issue is Mean Time to Repair (MTTP).
MTTP requires the fastest and precise observability metrics possible to ensure real-time AIOps remediation. If a performance degradation is detected by the observability platform, an automated incident prevention action would implement procedures to resolve the problem.
2. Runbook procedures
A runbook is a compilation of procedures and operations that are carried out to either automatically, semi-automatically, or manually triage issues that occur in systems. Typically, a runbook contains procedures to begin, stop, supervise, and debug a system or software.
With runbook automation, these processes can be carried out in a predetermined manner. In addition to automating specific processes, such as resource management and optimization, the runbook results can be presented back to the user for further action. Multiple runbooks can also be linked together with machine learning to provide interactive troubleshooting and guided or automated procedures.
3. Automated machine learning-driven code-completion tools
A new and emerging software repair and development capability is ML-powered coding completion tools. These tools accelerate software implementation by providing automatic code recommendations based on the code and comments you input within and IDE. They’re capable of generating entire functions and logical code blocks without the developer having to search for and customize sample code snippets.
Code completion tools can be applied for both testing and development shift-left pre-production initiatives to help speed the code implementation process. Code completion automation provides the code suggestions, but the final decisions about and integration of the suggestions are up to the practitioner making code completion a semi-automated process.
4. Manual repair
Of course, no discussion of software issue remediation is complete without referencing the traditional software issue remediation: manual repair.
Manual code repair has been the gold standard method for software repair and resiliency since the inception of programming languages. It has led to the most common software remediation term – Mean Time to Repair. MTTR is a maintenance metric that measures the average time required to troubleshoot and repair failed software.
Code triage methods that use MTTR as a metric are:
- Corrective maintenance:
Reactive software product modification performed after deployment to correct discovered problems. - Adaptive maintenance:
Software product modification performed after deployment to keep a software product usable in a changed or changing environment. - Perfective maintenance:
Software product modification after deployment to improve performance or maintainability. - Preventive maintenance:
Software product modification after deployment to detect and correct latent faults in the software product before they become effective faults.
Comparing automated, semi-automated, and manual repair methods
Take a look at the below image. Here you can see the benefits that automation and AIOps have to offer, allowing for quicker prevention and remediation of issues.
Whichever method you’re using, remember that the most important measure for rapid issue remediation is precise real-time metrics. All the repair types listed in the Remediation Spectrum are triggered by those metrics; the faster they’re measured, the more rapidly any remediation operation can begin. And they’re particularly critical for any of the automated AIOps procedures because users will be experiencing delays or disruptions for their application experience.
Reflections
AIOps isn’t a magic wand you can wave to automatically fix all issues with your software, but it does present an opportunity to decrease the time spent manually triaging issues, sometimes dramatically. Get ahead of the trend by trying out some of these AIOps repair methods.