CONNECT Time Puzzlement
One of the attractive features of using IntelliMagic Vision as a cloud service is that when something changes, you get technical experts excited about diagnosing what went wrong. Our curiosity kicks in and we have a strong desire to “solve the puzzle”.
Recently, a situation occurred at one of our cloud services customers that was a real head-scratcher. One of the disk storage systems saw a sudden jump in CONNECT time. Something had changed, but what was it?
As part of our cloud services solution, IntelliMagic professionals keep an eye on things and report any alerts back to our customers. Friday morning, the 28th of September, was just like any other Friday. We grabbed a cup of coffee and sat down to review our customers’ data. And WOW!
We saw a sudden change in the CONNECT time rating for an IBM-DS8886 that was alarming. We looked back at historical data and did not see anything like it in the past. This was something new and exciting (if you’re a performance guy).
The chart below shows a comparison of the CONNECT time on the day the problem started with the prior week. The change was dramatic and had a specific time of onset. In fact, it had increased 114% from the prior week!
One thing you might notice when you look at this figure is the different shadings seen in the chart area. These represent dynamic thresholds that IntelliMagic Vision calculates for each interval based on workload and hardware attributes. This is far superior to a static threshold because a high CONN time alone does not necessarily indicate a problem.
Finding the Source
By simply drilling down to CONNECT time by system, it was easy to determine that SYSID SYS21 was causing all the elevated CONNECT time. The chart below illustrates this.
Unusual VTOC Activity
Next, we wanted to know if the problem was isolated to specific datasets, so we simply clicked one more time to drill down to datasets. Indeed, we were able to see that the issue was coming from the VTOC of just a few specific volumes, as seen in the chart below.
What was this unusual VTOC activity? A conversation with the customer’s storage management team indicated that they had turned on AUTOBACKUP and AUTOMIGRATE for a few of their storage groups.
AUTOBACKUP and AUTOMIGRATE are the Culprits!
AUTOBACKUP and AUTOMIGRATE are automated functions of DFSMS used for storage management. Unfortunately, based on the data set reports the VTOCs were accessed inefficiently resulting in high CONNECT times.
Another negative impact was that the channel the LPAR was using was getting hammered. See the chart below which shows channel utilization for the channel attached to SYSID C1H0.
You can see the huge disparity in utilization caused by the AUTOBACKUP/AUTOMIGRATE processes. We suggested that the customer open a case with IBM since this behavior is unacceptable.
For a temporary workaround we recommended dropping the MAXBACKUPTASKS and MAXMIGRATIONTASKS settings from 6 to 1 to limit concurrency when turning on AUTOBACKUP and AUTOMIGRATE for the first time for entire storage groups.
Fortunately, there were no adverse effects to applications this time. But if other activity was present that stressed the channels, that could have been bad.
An Ounce of Prevention
This problem had a happy ending. In this case there were no serious implications from the elevated response time and channel utilization. By helping our customer be proactive and understand what was happening, they will be prepared in the event there are negative implications down the road.
Would you like to be more proactive in managing your z/OS Storage? IntelliMagic Vision could be the answer. Contact us for more information.
This article's author
Senior Storage Performance Consultant
Read Lee's bio
Share this blog
Colruyt Group IT opts for interactive mainframe analysis with IntelliMagic's expert knowledge
Colruyt Group IT needed out of the box reporting with built-in knowledge in the field of performance and capacity.
Using zHyperWrite to Improve MQ Logging Performance
In this blog we examine before-and-after measurements of a recent zHyperWrite implementation for MQ logging in a large z/OS environment.
How an Unexpected Drop in Response Time Led to a Quick Permanent Replication Improvement
Learn how a recent unexpected drop in asynchronous response time led to the discovery that one of the network links used for asynchronous replication went down and a permanent performance improvement.
Subscribe to our Newsletter
Subscribe to our newsletter and receive monthly updates about the latest industry news and high quality content, like webinars, blogs, white papers, and more.