Choosing the wrong V7000/SVC replication technology can put your entire availability strategy at risk.
For most customers, there seems to be a bit of a mystery in how replication works. On the surface, it is simple. Data is written to a primary copy and either synchronously or asynchronously copied to a secondary location with the expectation that a loss of data at the primary site would result in minimal data loss and a very minimal recovery effort.
There are several types of replication, and each type has its nuances. Each of these technologies should be evaluated in light of the following business requirements:
1. Recovery Point Objective (RPO): This is the amount of data loss expressed in time units (typically minutes) that you will lose should there be a failover to the secondary site.
- How much does it cost per minute of data that is lost? This should be estimated based on the types of data that might be lost, the amount of each type of data lost, and the probability of the loss.
2. Recovery Time Objective (RTO): This is the amount of time it takes to get back up and running as expressed in minutes.
- How much does it cost per minute for your business to be down? (See Toolkit: Downtime Cost Calculator for Data Center Disaster Recovery Planning for additional information)
3. How much does network bandwidth cost? This varies greatly by location.
4. What are the distances required between primary and secondary locations as prescribed by your Disaster Recovery plan? As an example, the SEC suggests 200 miles for financial institutions.
5. How much data do you need to replicate? You first need to decide which data needs to be replicated. Then you need to baseline the amount of peak writes /sec (expressed in Mbps) that occurs within the selected replication sets.
After developing your Disaster Recovery (DR) requirements, you need to decide the best technology solution that maps to your requirements and your budget. Fortunately, SVC/V7000 offers several types of copy services:
- Metro Mirror (MM) for synchronous metropolitan distances ensuring writes to primary and secondary disks are committed prior to the acknowledgment of host write completion. Use this only when the distance is short, the bandwidth is high, and understand that any congestion on fabric directly impacts the front-end write response time. The obvious advantage with synchronous mirroring is that your secondary site is always in sync with your primary site, so the Recovery Point Objective (RPO) is seconds vs. minutes.
- Global Mirror (GM) uses asynchronous copy services and is better for low bandwidth situations. Mild and occasional WAN congestion will not impact front-end write response times to the primary site copies. However, the secondary copy will be up to five minutes out of synch with the primary copy. This means that should the primary site experience a failure; the secondary copy may not contain all of the changes that occurred at the primary site.
- Global Mirror with Change Volumes (GMwCV) is essentially a continuous Flashcopy that asynchronously updates a remote copy. It completely isolates the primary from WAN issues but takes up significant disk capacity and cache resources locally and also leads to a remote copy that is an hour or more out of synch with your local copy depending on how you configure the cycle time.
- Stretched Cluster Volume Mirroring: This could also be considered a replication option within SVC/V7000 families. In this case, an SVC/V7000 cluster has nodes located at two locations allowing real-time copies to be spread across two locations. The RPO and RTO of Stretched Cluster Volume Mirroring is near 0 for each. The only drawback is that for this to be an effective solution you would need high-speed fibre optic connectivity and distance would need to be less than 100 km.
The following table summarizes the criteria that should be used when deciding the appropriate replication technology:
After selecting your replication technology, you need to determine the bandwidth requirements. IntelliMagic Vision translates your write data/sec to assist with creating a baseline of your network Peak Mbits/sec as demonstrated in the figure below:
In reviewing the chart above, we can see that we will need a minimum of 6,000 Mbit/sec to handle the peak workload. You should probably factor in some amount of growth. Furthermore, depending on whether you select a synchronous or asynchronous solution you can determine whether you need to plan for an additional buffer or whether you could get away with purchasing a little less than your maximum peak bandwidth. You should also consider adding sufficient bandwidth for the synchronization activity that will happen from time to time. If your maximum synchronization bandwidth is 50 MBps, then you should plan for 50 MBps/sec of additional bandwidth. Thus, your total Bandwidth Required could be expressed as:
Bandwidth Required = Baseline Peak Mbit/sec + Anticipated New Growth Mbit/sec + Buffer for Unexpected Peaks (20% for MM) + Synch Bandwidth
Using the example above:
MM Bandwidth required = 6,000 + 1,200 (growth) + 1,200 (buffer) + 400 (50MBits*8)(Synch bandwidth) = 8,800 Mbit/sec. If compression was not taken into account you would need to divide by the appropriate compression ratio.
GM Bandwidth required = 6,000 + 1,200 (growth) + 400 (Synch bandwidth) = 7,600 Mbit/sec. If compression was not taken into account you would need to divide by the appropriate compression ratio.
Note: For additional information, please see the IBM Redbook, IBM System Storage SAN Volume Controller and Storwize V7000 Replication Family Services.
What replication technology did you choose? In the next installment of this blog, we will discuss performance monitoring and analysis of SVC/V7000 replication technologies.
To read part 2 of this blog Click Here.
Storage Performance Analysis for an IBM SAN Volume Controller (SVC)
This white paper discusses the end-to-end I/O path, SVC architecture, SVC key measurements, and then provides guidance in diagnosing and resolving performance issues.
Availability Intelligence for End-to-End SAN Performance & Capacity
Transform unpredictable outages into predictable ones by creating Availability Intelligence from the storage performance and configuration data.
Noisy Neighbors: Identifying Root Cause of Performance Issues for SVC
We demonstrate how some extremely busy volumes created problems for an entire SVC/Spectrum Virtualize cluster, and how IntelliMagic Vision helped identify them.
IBM SVC Disaster Recovery - Execution VS Plan
Most businesses have a disaster recovery plan that they can execute if a crisis occurs at the primary site. It is not about how good your plan is, it is about how good your execution is.
Subscribe to our Newsletter
Subscribe to our newsletter and receive monthly updates about the latest industry news and high quality content, like webinars, blogs, white papers, and more.