Time is Not Always on Your Side
In the 1960s the Rolling Stones had a hit song called “Time is On My Side”.
Don’t you wish it really was?
Too often in our jobs and personal life time seems to be the enemy. Too much to do and too little time is the mantra for the modern world.
Time is a Key IT Metric
If you work in IT, time is a key metric that we often must measure and track. How long did that job run? What is my I/O response time? How much time before I run out of capacity? How far behind is my asynchronous replication?
That last question can be critical, especially if you work for a financial company. There are frequently regulatory requirements on how far behind your replication can become without penalty. And in the event of a site switch due to a disaster or outage, do you want to explain why several high dollar value transactions are now lost?
Asynchronous Data Replication
Asynchronous data replication predicates that your DR site is always a bit behind your primary storage. Your target for how far behind you can be is called your Recovery Point Objective or RPO.
With current asynchronous replication technologies, RPO may be anywhere from a few seconds up to about a minute depending on various factors. Some replication methodologies such as IBM XRC are designed to minimize RPO, but the tradeoff is that applications may be slowed down if throughput exceeds the replication capability.
Most peer-to-peer replication methodologies such as IBM Global Mirror (GM) or EMC Symmetrix Remote Data Facility Asynchronous (SRDF/A) prioritize application performance over replication currency. But if replication falls too far behind or stops altogether, you need to find out about it quickly and know what should be done to fix it.
How Can I Manage Replication?
Recently, we helped an EMC SRDF/A user that was having problems with their session suspending. At the time of the incident, they were unfortunately not using IntelliMagic Vision but were curious how it would have helped them both see and respond to it if they did have IntelliMagic Vision.
They sent us the appropriate SMF data from the time frame when the problem occurred and asked for our analysis. We loaded one month of data into IntelliMagic Vision and here were our findings.
Looking at the SRDF/A dashboard in Figure 1, the first thing we noticed was that indeed IntelliMagic Vision predicted that there was an issue with the SRDF/A Transmit Cycle Time a full two days before the outage. If prompt action had been taken, perhaps the outage could have been avoided altogether.
What is the Trend?
Trends can tell you a lot about where things are going. Figure 2 shows the rating over time for the SRDF/A cycle size. Cycles size is an indicator of the amount of data that must be transferred to the remote site during a single SRDF/A cycle.
You can see periodic spikes in the cycle size and those spikes were growing over time. Keeping an eye on this trend with IntelliMagic Vision would have provided an even earlier warning that something had changed.
What is the Source?
There were two primary VMAX disk storage systems (DSS) in the SRDF/A session. Were the spikes in data throughput coming from one of the VMAX systems or were they both contributing equally? Figure 3 below shows that the replication activity is coming primarily from one VMAX (EMC000002). In fact, the session dropped right after a big spike in writes from that DSS. We have found the culprit!
Why is this Happening?
We looked at the “Write Pending High-Water Mark” over time and found that whenever it exceeded 1.4 million slots, the session dropped. The large spike in write activity would consume too much cache on the primary DSS. Once it hit the threshold, the VMAX caused SRDF/A to suspend.
What is the Solution?
One possible solution is implementing delta set expansion. This is an SRDF/A feature that puts replication data on disk if too many slots are consumed in cache. However, this may lengthen your RPO and perhaps is not the preferred solution.
A better fix is to simply rebalance the two VMAX DSS’s. Using IntelliMagic Vision, it was found that most of the imbalance was due to a single storage group. By changing the allocations of that storage group, the suspends were easily eliminated.
Is it Time to Improve Your Availability Intelligence?
If you would like the intelligence needed to avoid outages like this, consider IntelliMagic Vision. With IntelliMagic Vision you truly will be singing “Time is on my side, yes, it is”!
This article's author
Senior Storage Performance Consultant
Read Lee's other blogs and bio
Share this blog
Subscribe to our Newsletter
Subscribe to our newsletter and receive monthly updates about the latest industry news and high quality content, like webinars, blogs, white papers, and more.
How an Unexpected Drop in Response Time Led to a Quick Permanent Replication Improvement
Learn how a recent unexpected drop in asynchronous response time led to the discovery that one of the network links used for asynchronous replication went down and a permanent performance improvement.
A Performance Analyst’s Guide to Mainframe zERT Analysis | IntelliMagic zAcademy
This webinar recording will show real use-cases to introduce you to some of the ways to identify security risks and issues within network traffic.
Using zHyperWrite to Improve MQ Logging Performance
In this blog we examine before-and-after measurements of a recent zHyperWrite implementation for MQ logging in a large z/OS environment.