Have you ever wondered about the impact of zero RPO on Mainframe Virtual Tape for business continuity or disaster recovery? This blog focuses on the impact of jobs using the Oracle/STK VSM Enhanced Synchronous Replication capability while delivering an RPO of 0.
A recovery point objective, or “RPO”, is defined by business continuity planning. It is the maximum targeted time period in which data might be lost from an IT service due to a major incident.
Brief History of Mainframe Offsite Tape Storage
Years ago, probably before some of you were born, IT organizations would back up all their disk volumes to tape, pack them in boxes and use “PTAM” (Pickup Truck Access Method) to send them to their offsite storage facility waiting for a disaster recovery test. The only varying of this procedure between organizations was whether Ford or Chevy trucks were used. IT managers knew exactly which tape volumes were offsite and when they went offsite. There was no guessing as to when the tapes went offsite – they saw the truck drive away. The tape operators had a list of volsers that were stuffed into those boxes. Security was a padlock on the tape boxes, which sometimes needed to be cut with a bolt cutter, firsthand knowledge on that one.
The best RPO with the truck method was 49 hours and more likely to be 73 hours. Not very good with current business requirements.
Virtual Tape Impacts
Virtual tape systems have changed the game. RPO has decreased, but at any point in time, most IT managers have less of an idea of which tape volumes are offsite than they did with the truck method. Each virtual tape vendor has multiple methods to move the virtual volumes offsite for disaster recovery. All can electronically send virtual tape volumes over a network to another site. Each vendor has their reporting mechanisms to show how much data is offsite and how much is left to send offsite. Tape volumes are created all throughout the night and day, making it difficult to detect when an RPO has been exceeded.
The Oracle/STK Virtual Storage Manager (VSM) is composed of control software and one or more Virtual Tape Storage Systems (VTSS). The VTSSs are also referred to as VSMs especially when referring to a specific model type such as VSM5, VSM6 or VSM7.
A Tapeplex is a single hardware configuration normally represented by a single Control Data Set (CDS).
A VTSS can be connected to other VTSSs via 1 Gb or 10Gb Ethernet links.
Replication copies Virtual Tape Volumes (VTVs) between VTSSs. Replication can occur while the VTV is mounted (synchronous) or after the VTV has been dismounted (asynchronous).
A cluster is created when this connection is between two VTSSs within the same Tapeplex.
Cross Tapeplex Replication(CTR) can replicate VTVs between two Tapeplexes.
Oracle/STK VSM Replication Modes
The following Replication Modes are available:
- Asynchronous Replication – the tape volume is replicated after the Rewind/Unload command and after the volume has been dismounted. The host job will not wait for the volume to be replicated.
- Synchronous Replication (legacy) – the VTV is replicated after the Rewind/Unload command is issued by the host job but before it is completed by the VTSS and before the volume has been dismounted. This impacts the host job because it is delayed until the data has been copied to the remote site. This job elongation was why it was not very popular or widely used.
- Enhanced Synchronous Replication – recently became available and replaces the legacy Synchronous Replication. This replication allows the VTSS to begin copying data to the target VTSS as it is being written by the host job. The host job is not impacted until it issues the Rewind/Unload. It is then delayed until the VTSS completes the copy to the target VTSS. This guarantees that all data is in both VTSSs when the Rewind/Unload completes.
SMF analysis is needed for all replications modes. For Asynchronous Replication, it is needed to determine if the RPO has been exceeded. For Synchronous Replication, it is needed to determine how long the host jobs have been delayed while the VTV is replicated.
Oracle/STK VSM Enhanced Synchronous Replication Performance
The following charts are from a configuration utilizing enhanced synchronous replication. IntelliMagic Vision for Virtual Tape can report the job elongation time at 15-minute interval averages. Below is an IntelliMagic Vision chart showing job elongation time. The Average Sync Replication Time (sec) is the number of seconds from Rewind/Unload until the replication is complete. The legend shows the 8 sending VTSSs.
The overall weighted average delay time is 4 seconds. For most intervals, the delay is less than 10 seconds, although some intervals average 45 seconds or more. By using the drill down option to separate the workload by VTV size IntelliMagic Vision for Virtual Tape will show why some intervals are doing fine and others are having more job elongation time.
Oracle/STK VSM Enhanced Synchronous Replication Separation by VTV Size
This drilldown chart shows a marked difference in the job delay time between 4GB and 32GB VTVs
Even though the 32GB VTVs are delayed much longer than the 4GB VTVs the delay is still much shorter than the legacy synchronous replication method. The delay time for it would have been 32,000 MBs divided by 100 MB/second resulting in a 320 second delay. The next chart isolates the 4GB volumes to get a better picture of the delay time.
The 4GB virtual volumes are averaging only 2.5 second job delay time. The next chart shows the 32GB volumes isolated and they are averaging much longer delays.
Measuring a Zero RPO Strategy
Deep analysis of the Oracle/STK VSM SMF records can be performed using IntelliMagic Vision for Virtual Tape. IntelliMagic Vision interprets the SMF data from the mainframe using built-in intelligence and ratings about the specific hardware and workloads, and puts them into a database for detailed and historical trending reporting in easy-to-use graphical views.
How to Find Sick But Not Dead (SBND) TS7700 Tape Clusters
Rather than waiting for a remote VTS to fail, you should be reviewing these key TS7700 reports to determine if remote clusters are receiving replication data or not.
Are My Remote Clusters Receiving Replication?
Review these key reports when troubleshooting remote cluster issues or trying to determine if your remote clusters are receiving replication.
Colruyt Group IT opts for interactive mainframe analysis with IntelliMagic's expert knowledge
Colruyt Group IT needed out of the box reporting with built-in knowledge in the field of performance and capacity.