In today’s virtual tape environment, one of the key performance measures is whether the tape data is being replicated to the Disaster Recovery site in a timely manner. In other words, “Are my Tape RPO’s being met?” This video demonstrates how you can leverage IntelliMagic Vision to identify replication performance issues and drill down to root causes for issues in your TS7700 grid environment. In this case, the replication from two host clusters to a third DR cluster is getting significantly behind and they are not meeting their SLA’s for Recovery Point Objectives (RPO).
In today’s virtual tape environment, one of the key performance measures is whether the tape data is being replicated to the Disaster Recovery site in a timely manner. In other words, “Are my Tape RPO’s being met?” Today I’m going to demonstrate how you can leverage IntelliMagic Vision to identify replication performance issues and drill down to root causes for issues in your TS7700 grid environment.
The customer in this case study was a large z/OS IBM TS7700 tape user. While using IntelliMagic Vision, he realized that his tape data was not getting replicated to his DR site in the timely manner specified in his Service Level Agreement’s, or “SLA’s.” His Recovery Point Objective, or “RPO” was not being met due to the replication taking longer than expected. We will see how IntelliMagic Vision was used to examine this issue and subsequently determine the causes and possible action plans to eliminate or lessen the impact.
The first indication that the customer saw indicating a problem was on the TS7700 Replication Dashboard for this Grid2C. It showed that the Average Deferred Queue Age for CL1 exceeded the thresholds for that metric. Average Deferred Queue Age is a measure of how old are the tape volsers still waiting to be replicated to a receiving cluster. In this grid, clusters CL2 and CL3 are host attached and they replicate their tape data to cluster CL1 which is located at the DR site. This replication dashboard also shows that the replication backlog for CL1 had also exceeded its thresholds. Additionally, the dashboard indicated that significant Deferred Copy Throttling was occurring on clusters CL2 and CL3.
Using the standard drill-down capability within IntelliMagic Vision, we can focus on the Replication metrics time charts for the CL1 cluster in this grid. By clicking on the CL1 row, we will be able to see a Replication Overview multi-chart for CL1 showing various replication metrics for that cluster. These charts are from the perspective of the receiving cluster. A multi-chart display can give you perspectives that are not available when looking at only metric at a time. It allows one to perceive how one metric may have an impact on one or more of the other metrics. Here, we can see that at peak times during these four days, the replication backlog builds up to over 6000 GB and that the Average Deferred Queue Age is over 150 minutes. Since the customer’s SLA called for a 60 minute RPO, he was not meeting this SLA. With careful examination of the times on these charts, he saw that the backlog was not coming down until after the Inbound Total Copy Data Rate jumped up to approximately 400 MB/sec. If we want to look a little closer at the Average Deferred Queue Age, we can click on that chart.
This better highlights the thresholds that are being used to rate this metric. So, is this an issue with CL1 not being able to receive the replicated data fast enough or is this because clusters CL2 and CL3 are not sending the data fast enough? IntelliMagic Vision has some special multi-chart Perspectives which can help to answer this question. These perspectives for various aspects of the functions within the TS7700 grid, such as Replication, Migration, Recall, etc, were introduced in version 8 of IntelliMagic Vision for z/OS Tape. Again, they let you easily see how various measurements relate to each other.
Here’s the Replication Perspective from the sending clusters perspective for grid 2C. We note that deferred copy throttling is occurring quite often, and this can slow how fast deferred copy replication is proceeding. CPU Utilizations over 85% can cause throttling but we can see that these do not exceed 85% on any of the clusters. However, by default, throttling can also be invoked when Virtual Device Throughput exceeds 100 MB/sec. And we can see that this level is exceeded much of the time. Fortunately, there are tunable parameters that can be set in the TS7700 to control replication. The DCT Threshold (DCTAVGTD) defaults to 100 MB/sec. While this was probably a good default for the 1st generation of the TS7700, it is probably too low for the 2nd generation which has much more powerful Power7 processors. In this case, they might want to set the DCT Threshold as high as 350 MB/sec to eliminate deferred copy throttling at all but a few isolated peaks. You can also minimize the impact of throttling by setting the DCOPYT parameter from it’s default of 125 msec (which permits only a trickle of replication) to something like 10 milliseconds or even 0 milliseconds to eliminate throttling altogether. Note that most of this is explained for this replication perspective in the upper left hand corner oo the screen.
Here, we can definitely see that CL2 and CL3 are deferred copy throttling with, at times, as much as 125 milliseconds of delay, which effectively slows down replication to a trickle. By modifying the DCOPYT and/or the DCTAVGTD tuneable TS7700 parameters, they should be able to reduce or eliminate the unnecessary throttling that is occurring and reduce their Average Deferred Queue Ages. Of course, after these changes are made, we would want to examine these charts to determine if the desired impact was achieved.
So, in summary, we used the dashboards in IntelliMagic Vision for z/OS Tape to identify that there was a replication issue. Then, using the various drill downs and perspectives available in IntelliMagic Vision we were able to investigate the problem and see some advice on possible solutions. And of course, after any changes made, you would want to examine the dashboards and charts again to verify that the desired results were achieved. If not, then further investigation may be required.
In conclusion, IntelliMagic Vision was used to identify replication issues within a TS7700 grid. IntelliMagic Vision uses expert knowledge about systems, configuration, and workload to proactively identify potential risks in your environment.
If you would like to find out more about IntelliMagic Vision for your environment, please email us at firstname.lastname@example.org
Are My Remote Clusters Receiving Replication?
Review these key reports when troubleshooting remote cluster issues or trying to determine if your remote clusters are receiving replication.
How to Find Sick But Not Dead (SBND) TS7700 Tape Clusters
Rather than waiting for a remote VTS to fail, you should be reviewing these key TS7700 reports to determine if remote clusters are receiving replication data or not.
Mainframe Health Checks 101 – When was your last checkup?
Comprehensive health checks for your z/OS mainframe are essential for avoiding potential performance or availability issues.
Sign Up for a Free Trial
Experience firsthand the deepest visibility into your z/OS systems, processors, network, disk and tape environments. Whether you’re in the early stages of product research, evaluating competitive solutions, or trying to solve a problem, we’re happy to help you get the information you need to move forward with your IT initiatives.