For large z/OS sites, daily performance monitoring is critical. Every minute of IT unavailability can be extremely costly to the business, so these sites employ the best staff and solutions available to keep their infrastructure always available.
Through daily proactive performance reviews, one such site was proactively alerted to a disk replication issue involving ongoing exceptions for high asynchronous replication response times on one of their primary disk subsystems.
Thanks to this proactive notification, and the quick work of the IT staff, they were able to remedy a configuration shortcoming and avoid any impact to the business.
Asynchronous Replication Problem Analysis
After implementing replication on storage system P1111 their IT availability solution, IntelliMagic Vision began reporting consistent warnings and exceptions for asynchronous replication send response times.
This exception is illustrated in Figure 1 below.
IntelliMagic Vision displays warnings and exceptions in consolidated dashboards that show the key replication metrics for all storage systems rolled up over an analysis period of your choosing. Metrics are assessed based on detailed knowledge of the configuration and performance capabilities of the storage systems. When a warning or exception is automatically detected, the metric changes colors to yellow or red, respectively.
Drilling down on the exception, by simply clicking on the big red circle, consistently showed that send replication response times were high, even though the amount of data being moved was not exceeding the expected link capabilities:
The five mini-charts in Figure 2 let you see at a glance the key asynchronous replication data metrics for this storage system and alert you to problems.
For each port over time for the full day you see Total Asynchronous Replication Throughput, how much is sent vs. received, and the corresponding response times. You can see by the color of the border which metrics are neutral (no color or rating), within the capabilities of the configuration (green), or exceeding the capabilities of the configuration (yellow or red).
You can zoom in on the problem area by clicking on the mini-chart with the red border.
Figure 3 displays the asynchronous replication send response times for all Link IDs in the storage system over one day:
We see that all eight Link IDs have very similar response times throughout the day. Since the link response times are balanced, this rules out a problem with an individual Link ID. We also see that response times are typically in the 120 ms range, but spike to 250 ms or more.
We next review the volume of data replicated by selecting the chart for Asynchronous Replication Send.
In Figure 4 we see that the volume of data being sent is not very high. The green border on the chart confirms that the volume of data is well within the capabilities for this model of storage system. The darker background area shows the warning threshold which is set to 55 MB per second, indicating the level at which data volume would start to be a concern. These data volumes are well below that threshold and are not the reason for the high response times.
Misconfigured Secondary Storage System Leads to High Response Times
The z/OS site was migrating data to that storage system overnight on a scheduled basis and was initially unconcerned about the high response times.
In addition to using IntelliMagic Vision as their IT availability solution, the customer was also taking advantage of IntelliMagic Services, which allow IntelliMagic performance consultants to add additional eyes to their environment on a regular basis to help proactively identify issues of concern. IntelliMagic Services continued to provide daily reporting on the customer’s environment and highlighted this issue as one that deserved further analysis.
A key point was that this customer had other storage systems being replicated, all of which had much lower response times than P1111 as demonstrated in Figure 5:
This prompted the customer to perform additional analysis on the secondary storage system receiving the data, and it soon became clear that the secondary storage system was the bottleneck. The customer took this information to the storage system vendor and discussed the overall analysis and potential remedies.
The vendor conceded that the secondary storage system was not properly configured to handle the replication volume required and provided an upgrade at no charge, saving the customer potential availability issues as well as hard dollar costs relating to the hardware upgrade.
Comparing the Data
With the bottleneck addressed, a comparison of data from the week after the upgrade to the week before the upgrade showed that response times improved dramatically, even though the volume of data sent was significantly higher.
IntelliMagic Vision allows you to compare 2 different time periods in side-by-side charts to more easily see changes. Figure 6 below shows throughput of data sent for a day before the storage system upgrade (left) compared to a day after the upgrade (right).
We see at a glance that the volume of data sent is higher after the upgrade, especially at peaks. We also have dotted lines showing us the average throughput value for each day, and the comparison calculated at the top of the chart, an increase of 85.40%.
Even with significantly higher volumes of data sent, the response times after the upgrade are much better. The comparison chart in Figure 7 shows a 63% improvement.
As an additional benefit, the vendor agreed that P1111 did not have sufficient cache and agreed to upgrade it as well.
Avoiding High Asynchronous Replication Response Times
This use case highlights the value of ongoing proactive performance analysis of your critical IT infrastructure. Potential issues can be automatically recognized by IntelliMagic Vision so they can be addressed before they become serious enough to impact the business. It also highlights the need to be able to analyze issues over time, with comparison and drill down to narrow the problem space with the right metrics for the situation.
IntelliMagic Services notified the customer of ongoing exceptions for high asynchronous replication response times on one of their primary disk subsystems. The issue turned out to be the inability of the secondary disk subsystem to receive the data quickly enough.
Supporting data metrics from IntelliMagic Vision, including a comparison of asynchronous replication response times for P1111 vs other disk subsystems in their environment, gave them the information they needed to evaluate further and address a configuration shortcoming with the vendor. Configuration changes were made by the vendor, at no charge, which improved response times before there was any impact to the business.
zHyperLink: The Holy Grail of Mainframe I/O?
This white paper discusses IBM’s positioning of the zHyperLink technology, and provides some considerations for installations that consider to deploy it.
Reduce Latency and Lower Demand as You Create the Capacity Plan
Mainframe capacity planning is an important budgetary input. If you ensure your performance is in order or have mentioned some efficiency recommendations, you will help keep your organization lean and productive.
Managing the Gap in IT Spending and Revenue Growth in your Capacity Planning Efforts
Capacity planning is important, but don’t let the importance of future capacity needs for the budget drive you to overlook opportunities to building a bridge toward better performance and longer term efficiencies.
Don’t Let RPO Knock Your SOX Off!
Proving that you maintain control of your recovery point objective (RPO) is key for SOX compliance if you use asynchronous replication to protect your data.