What are the top 5 million things you need to do today to avoid application infrastructure performance problems? Because that’s usually what it comes down to.
In a perfect world and a perfect environment, all IT infrastructure operation issues would be proactively identified and prevented well before any application end user ever felt any production impact. The reality is often much crueler. Too often performance teams are overworked and understaffed, running from one production fire to the next; always trying to solve problems in real time; always in crisis mode.
6 Steps to Proactively Prevent Any Potential Issues
But perhaps there comes a day with no production fire (that you’ve been alerted to), and you want to devote your time to proactively prevent any possible upcoming issues. Great. Now all you have to do (every day) is:
- Collect the millions (or billions) of performance, capacity, and configuration data points from the various infrastructure components involved
- Segment out from the shared infrastructure resources the portion that represent the specific application in question
- Perform some calculations on related metrics to produce valuable new metrics and ensure that you have visibility into root causes of problems, not just symptoms of problems
- Evaluate the metrics in the context of the infrastructure that is running them and see if any of them are out of range of what is normal and good for the type of workload, the specific capabilities and capacity of the infrastructure based on expert knowledge of what it can do, the best practices for the infrastructure, and check for any misconfiguration or errors
- For out of range items look to see how the metrics in question compare to other recent periods to see when the changes started
- Produce the path of how the root cause metrics are out of range and which symptoms (response times, etc.) are going to be negatively affected if the underlying conditions are not proactively addressed
That’s it. That is all you have to do every day in order to be proactive in identifying issues and avoiding them before they become real problems. That, and avoid being bored to death when 99.9% of the metrics are not out of range. Then you only have to determine which of the 1/10 of 1% are the important ones to look at? Easy, right?
Don’t Steal the Computer’s Job
The question isn’t really whether a human can do the steps listed above. It’s why should they even try? Computers have been developed to do the work that humans should never have to do – process millions upon millions of data points and automatically understand what’s important, how it’s relevant to the data around it, and what the level of severity is.
This is what they are good at, and instead of fighting to keep these jobs for ourselves (while at the same time never having the time or aptitude to get to it), we should devote our time to tasks that humans are built for.
If you wanted to dig a ditch, you wouldn’t use a spoon just because it would keep you busy. Today you wouldn’t even use a shovel! Technology was developed as a tool to enhance our capabilities and make better use of our time – Artificial Intelligence is the next evolution of tools designed to help us.
Taking Advantage of AI for your IT Operations
When utilized correctly, and combined with human expert knowledge, Artificial Intelligence will dramatically simplify and streamline complex processes and tasks for performance and capacity teams. Teams who are encouraged to be proactive will be able to take advantage of the numerous benefits AI offers:
- Automate data processing (and all 6 tasks from the start of this blog)
- Identify potential problems predictively
- Offer recommendations to resolve existing issues
- Equip current staff to be more productive, and streamline training for new staff
- Optimize environments for performance and cost savings, and of course,
- Avoid application infrastructure performance problems
Almost all new technology meets resistance at its offset until it crosses the threshold from “that is never going to happen” to “oh, everyone is already using that.” The alternative may be that you can get it done (eventually) without the assistance of the technology, but what would you be able to do with that extra time? In the meantime, I think there are about a million things that need to be done to prevent the next issue.
5 Key Attributes of an Effective Solution to the z/OS Performance Skills Gap
The skills shortage for z/OS performance analysts and capacity planners has left many organizations struggling to ensure availability. Modernized analytics can help solve the skills shortage by accelerating the acquisition of skills for z/OS performance analysts and capacity planners.
IBM z15 Announcement Highlights and How to Take Advantage
The z15 (with a General Availability date of 9/23/2019) offers up to 190 CPU cores (vs. 170 on z14) and 40 TB of usable memory (vs. 32 on z14), in addition to processor cache and overall performance improvements.
CPU – Just the Tip of the z/OS Iceberg
z/OS is now responsible for everything from online transactions, network traffic, replicated data, and so much more. The infrastructure to handle this complexity is no longer limited to ensuring CPU is going to be okay.
5 Things Every Storage Professional Should Be Checking - Prevent Fires Before They Start
This blog identifies five key areas that every storage professional should regularly check to identify and eliminate upcoming issues and, in turn, save time in the long run.