Enterprise computing systems and storage operations teams have a difficult job: manage the IT infrastructure so that application availability is always efficiently maintained. But this is virtually impossible due to the complexity and disparity of the meta-data and reporting tools for all the various infrastructure components. A lack of information is not the problem, rather the great need is to derive meaningful intelligence out of all the information.
This is one of the reasons why the Cloud and Outsourcing can be so attractive to executives: the responsibility, complexity, and cost of delivering infrastructure performance shifts to someone else. But the cloud, for example, will not work for all applications due to performance and security requirements. And outsourcing doesn’t make infrastructure performance problems go away, in fact it can make them harder to resolve. So most enterprise organizations will still benefit from and require deep infrastructure performance analysis capabilities.
In recent years, a new class of products initially called IT Operations Analytics (ITOA) have come on the market with the design objective of providing a single interface into all the data generated from disparate devices, and more importantly, helping interpret what it really means for performance, availability, and efficiency.
IT Operations Analytics
The idea is to employ the computer to do more of the work of deriving meaningful intelligence out of all the data. If designed correctly, this is a type of artificial intelligence which is done by the machine and enables human IT operations teams to be more effective. In 2017 Gartner coined the term AIOps which is a nice nomenclature for the capability.
A Forrester Research definition for IT Operations Analytics is:
“The use of mathematical algorithms and other innovations to extract meaningful information from the sea of raw data collected by management and monitoring technologies.”[Wikipedia]
So it is not just about an easier way to create new or more reports. IT operations teams already have more reports than they have time to look at. It is about being predictive and about solving problems more quickly, whether they are in the predictive stage or the real-time production problem stage.
By using mathematical algorithms, for example, unusual workloads across different devices can be automatically identified. This is of some value in identifying production problems, but it is not enough to predict and prevent problems, or to determine root cause because the metrics correlated are usually only the symptoms of problems.
From IntelliMagic’s perspective, in the definition above “other innovations” must go beyond statistical algorithms; this is absolutely required in a solution that truly predicts problems, solves issues more quickly, and improves efficiency of teams and infrastructure.
The reason is that for the algorithms to understand root cause analysis, the absolute truth of infrastructure capabilities and best practices must be accessible in a machine-readable form. Statistical analysis of metrics only relative to one another is insufficient for root cause analysis or monitoring of those root causes.
Artificial Intelligence for IT Operations Analytics
This type of Applied Artificial Intelligence (AI), that can utilize machine readable, expert knowledge about the specific infrastructure components in use, is the key innovation that puts magic into the process. AI means many different things to different people and there are many valid forms of AI as can be seen in the The Periodic Table of AI, for example.
Here is a practical definition of AI from Marvin Minsky back in 1968:
“Artificial Intelligence is the science of making machines do things that would require intelligence if done by man.”
Is there anything you as an IT expert could do, that would be valuable if you did do it, that either you don’t have time to do, or you would be bored to death doing every day? These are exactly the tasks where AI will make a big difference in your IT operations.
For example, evaluating the millions of individual metrics to see whether they are approaching the limitations of any physical infrastructure components. Or evaluating how well the best practices recommended for example in IBM Redbooks, are or are not within acceptable ranges of compliance for your workloads on your infrastructure configuration.
Machines are for Answers, Humans are for Questions.
Kevin Kelly has said: “Machines are for answers, humans are for questions.”
In many domains, answers are becoming a commodity and it is intelligence about what questions to ask that is more important. For example, the Google search engine or questions you might ask the assistant on your smartphone have access to vast amounts of information. Wouldn’t it also be great if the IT operations team could simply ask the computer if there are any bottlenecks starting to develop that are not yet showing up as production problems? Or ask “how well are all of the infrastructure components delivering the required service levels for our most important business application?”
That is exactly the point of IntelliMagic Vision – it can assess all of the metrics against machine-readable knowledge about the specific infrastructure components and best practices, and it surfaces the answers to those kinds of questions.
These are often incredibly time-consuming answers to find for virtually all the large IT shops without AI-enabled software that incorporates infrastructure-specific expert, automated analysis of the disparate data. Not having this capability leads to the IT operations teams spending far too much of their time in stressful fire-fighting mode, reacting to problems that have occurred rather than responding to predictive analysis so they can proactively identify developing issues and address them before it is an emergency.
Platform-specific intelligence in a solution is also required to enable the human staff to seamlessly navigate and drilldown through that disparate but inter-related metrics for a problem, and to see metric definitions and explanations of rated assessments that are out of range. This can dramatically improve the effectiveness of their analysis even for expert users, and it can accelerate learning for newer staff.
IntelliMagic software includes a built-in fundamental understanding of how workloads, logical concepts and physical hardware interact, and what represents good or bad values based on the infrastructure context. This understanding, or intelligence, is applied to the data automatically or “artificially” by the computer, rather than requiring a human expert to manually evaluate the data.
This narrow type of AI use in hardware or software robots that are designed to do specific tasks is far more common and practical than general AI or thinking machines. And it results in far faster and more practical problem prevention and resolution capabilities than Machine Learning approaches which typically employ only a platform agnostic view of metrics.
Understanding, monitoring, and predicting the root causes of application infrastructure performance problems requires digitized, platform-specific expert knowledge that is not yet feasible for machine learning approaches.
By embedding infrastructure-specific expert knowledge into the computer software, the infrastructure can continuously be evaluated for most of the issues that are likely to cause service disruptions for the application workloads they are running. And when issues are identified, the software can anticipate what the human analyst will want to see so that intelligent navigation through the interrelated data model can be pre-prepared and flagged, for example.
This level of evaluation and automation elevates the human analyst into a more strategic position asking questions and getting the answers from the computer far more easily than when the analyst only has static reports or raw data to deal with.
Predictive Intelligence for z/OS Systems Infrastructure
The goal of this white paper is to show you how to apply predictive intelligence to your z/OS Systems infrastructure analysis so you can avoid costly disruptions, empower your IT staff, and optimize your environment.
The z/OS Performance and Capacity Skills Gap
IntelliMagic Vision collects performance and configuration information on the VMware, fabric and storage systems to provide a complete and end-to-end picture.
Understanding & Dealing with z14 Traffic Patterns
The z14 is designed for massive, parallel processing. So why do delays still occur? This webinar will explore common sources of application delays and discuss practical solutions to reduce these delays.
Best Practices for Managing your SAN Performance (Part 3: Planning)
Within infrastructure capacity management it is important that we consider growth to help us understand future costs for budgeting purposes.