IT Operations – Towards automation and predicting

SMT has been helping customers with IT Operations issues for many years. Partly thanks to the new SMT Enriched Data Analytics Platform® (SMT EDAP®), it can accelerate IT processes and make them more efficient, Product Manager Erwin Vrolijk explains.

 

Everything that has something to do with IT is changing rapidly. Vrolijk notices that today’s customers have different questions than before. “The environments are becoming more technically complex and IT Operations must be able to properly manage such an environment. In the past it was often about simple questions. How full is the memory? What about the network load? Now, environments are changing much faster and responsibilities are more complicated, among other things due to the different types of cloud models. Security and compliance become a consideration and organizations want to look at such an environment from different job levels and departments.”

 

Five steps

SMT has developed a roadmap for customers to optimize IT processes, maintaining maximum control. “We have identified and described different phases. Each with its own characteristics and approach. Moving towards AI Ops you need to sort out a number of things first. There is a dependency between the different phases. It is a total of five steps that indicates the degree of maturity. The fifth step is the ultimate goal. At that point, with the help of Artificial Intelligence among others, companies are fully predictive.”

The first step is called ‘reactive’. “Companies often work as some kind of fire brigade; they solve each problem separately, without any form of centralization or automation. That takes a lot of time.”

The second step is called “expective”, aka wait and see. “The known, recurring, problems and the responses to them have been mapped and are dealt with faster. You do not yet see problems coming and new, unknown, incidents are still being solved with great difficulty. You can respond faster, but the incidents are still dealt with separately.”

After that, “operational visibility” is the third step. “The technical chain that is needed to deliver services has been mapped. As a result, problems are identified faster and a good estimate of the impact of a disruption can be made.”

Step four is called “IT insights”. “It is crucial that the information from the previous phase is now no longer only used by IT, but also gives direction to the business processes. Data becomes part of the processes and is actively used to support critical decisions. Do I choose new hardware or the cloud? Am I investing in my database platform or in my network?”

SMT calls the ultimate fifth step “AI Ops”. “At this point, decisions are automated and incidents are predicted so that potential problems are solved before they occur. This is the ultimate goal of every organization.”

Predicting

“We first analyze what phase companies are in, individually for each department and even every application,” Vrolijk explains. “We look at the pain points. For example, are certain data sources missing or are SLAs not being achieved because systems are slow or fail? This way we decide where to start. After all, it’s not just a matter of using artificial intelligence or automatically using playbooks in the event of an alert. Before you get there, you must first take care of many other things. Because if you want to be able to predict whether an application will fail, you must first map out how that component performs.”

The SMT Enriched Data Analytics Platform® is a combination of Splunk software and its own range of services.
It makes it an important foundation. “You put data in it and then look at the information from all kinds of roles.
It makes no difference whether you, as a specialist, want to look at log files in great detail, or as a manager,
with a more holistic view on capacity reports, uptime or SLAs. It’s all possible with the SMT EDAP®.”

Solving

With an example, he explains how SMT works. “A customer who works in a private cloud environment had an issue with a business-critical application that would occasionally crash, at unpredictable moments. This can have many causes. We have gained insight into the entire stack, from hardware to the application. That way we could quickly identify that the problem had to be somewhere in the application. Because we measured on, among other things, CPU usage and memory usage, we were able to see, partly thanks to all the data in our platform, that there was a memory leak in a specific module of the application. As a result, the application gradually started to use more and more memory and crashed. We were able to solve that and with the same software we validated that the problem was actually solved. What made it even better was that we were able to predict a new and comparable problem due to those specific measurements, without the customer ever having been bothered by it. The same indicators lit up for another module of the same application, and so we prevented another crash and the associated downtime.”

 

The platform also helps IT operators to structure alerts. “We have seen situations in which more than ten thousand alerts were sent to an operator every day. That is impossible to do. They usually see no more than five percent, while the rest also contains problems that are important. Organizations that are a bit further in the roadmap can cluster alerts with the help of machine learning and reduce them to, for example, ten groups. This makes the work of the analyst or the operator much more efficient.”

 

Vrolijk has a clear vision of the future. “In five years’ time, what we regard as the fifth step, AI Ops, is largely a reality. A part of IT Operations will be automated by algorithms and the people and teams responsible for it can deal with other, more difficult problems.”

 

July 2019