Extending a Ticket Analyzer for AI-based Alert Management in Various IoT Domains
For quite some time now, data scientists and machine and deep learning (ML/DL) experts are using different techniques to automate triaging and analysis of tickets in various hi-tech sectors. Extraction of meaning and context from the bug-reports or tickets enables a partial automation of the workflow in the defect management process. When this automation is integrated with an automated digital workflow tool, it creates complete automation of defect management and significantly improves its productivity and accuracy. Apart from automation, an AI/ML-based ticket analyzer solves the following problems of a manual process.
- Effort, experience requirement, and bias: Significant effort is spent for allocating bugs, found during testing, to a development team or to a support center. For a large development project, a team of triaging engineers, who understand the underlying system characteristics and secret sauce of architecture, is deployed. It has been found that even with a trained and experienced team, bias creates delay and inaccuracy in allocation.
- Assignment of severity: While priority is process and context sensitive, the severity of a bug depends on the design and the system requirements. Understanding severity takes experienced engineers and some time for a greenhorn to assign the correct severity to the bugs and allocate it to the appropriate developer.
- Removal of noise and “not a bug”: A significant effort goes in understanding whether a reported bug is a feature or a bug and in other terms, whether it is a false positive.
- Root cause for a set of bugs: Sometimes, a bug or an issue manifests itself in different ways and different places in a software. Identification of the root cause can save a lot of effort during the solution process. In a manual system, that process is dependent on the experience level of the triaging engineer.
Today, some of the well-known workflow management products use some form of AI-based tools for categorization and appropriate assignment and these products are available commercially.
Though they started with the development process of large software projects and automation of infrastructure-maintenance, the ticket analyzers, when applied appropriately, have a good potential in automating many other operator-managed processes in various IoT domains. For example:
- Alerts and logs of an IaaS application in a data center
- Alerts of a SCADA or building management system
- Alerts of an IoT-ized smart city command center
- Alert-based retrofitted predictive maintenance module in an IoT-ized system
- Patch identification and management based on the defect classification
This can also be automated by a similar automation scheme. The knowledge base of these aforementioned systems, which an experienced human operator learns and uses to identify, categorize, and troubleshoot, can be appropriately encapsulated in a properly trained AI system similar to what is available in a ticket analyzer. The essential components of this type of an AI system are:
The basic idea of the system in Fig 2 is to divide the alerts/bugs/logs into multiple categories and automatically direct the categorized alerts to appropriate attendants/processes.
Historical Data: Properly labeled historical data for appropriate categories should be prepared from the raw data. This is an important step and needs to be properly crafted based on the requirement of the workflow and a large amount of data is necessary for the system to be accurate.
Text Analytics: This is the most important step in the entire AI system. First, topics are identified for categorization. Various natural language toolkits can be used to build the models of topics and identify topics that will be used for classification in the later steps. In some cases, a rules engine can also be used for topic generation.
Model Development: Next, an ML or a DL model can be created to train on the classification according to the aforementioned topics. In the recent past, RNN with an LTSM layer has performed very well for a reasonable number of topics and complications. The trained model can be deployed for online classification.
Classification and Clustering: In actual deployment, the aforementioned models are deployed as a part of the alert management pipeline. Whenever the alerts are generated, the text associated with the alerts is processed through the model and classified into topics. Clustering is used for the root cause analysis for a bunch of alerts generated before a specific problem.
Though the core of an alert management system is text analytics, which is very similar to a ticket analyzer, most of the alerts have variables and their associated values. Their quantitative nature enables a few useful use cases to be implemented in an alerts management system.
Variables in Alerts: Unlike ticket analyzers, alerts contain variable(s) with their value(s). When the value of a variable in an alert is below a threshold, alert is simply noise of the system. Sometimes, this noise is determined by a group of variables. An alert management system can identify the cause and the severity of problem from those variables and inform the attender accordingly.
Failure Prediction: It is possible to identify performance degradation and suboptimal operation based on some variables in the alert management system. A properly designed alarm management system not only channelizes the alerts to specific categories or finds a root cause of a problem, but it also creates a threshold for an impending degradation or failure.
Automatic trigger of action: An alert management system can automatically trigger an action for immediate mitigation of the problem for which alerts are generated.
Summary: AI based tools for triaging bugs and analysis of the tickets raised by test engineers have been used for some time now. These tools improve efficiency of allocations and productivity of fixing bugs significantly. These tools help automate the whole workflow of triaging, allocation and fixing process of bugs. Some of the core modules these kind of AI tools can be used to develop an automated alert management system.
AI-based alert management systems can improve the productivity of IoT, IaaS application, SCADA and similar systems. Additional features/ use cases need to be implemented over a ticket analyzer to make an efficient and complete alert manager.