Dave Heggen - 8 June 2018

Our story begins as our stories usually do, somewhere in the middle after the customer has been working on a problem, in this case low throughput to virtual tape, for a while and they are just about to give-up.

The customer had a group of tape jobs that frequently would not finish on time. On time means the jobs complete within the batch window. Not on time means they would run beyond the batch window and would compete with the online activity. Some days the job would run in less than an hour, while the same job on other days would run for 10 to 15 hours.

Low Throughput to VSM Tape Systems

The customer investigated the issue when the jobs ran long and found that the throughput to the VSM tape systems was low for the jobs in question. A joint investigation with Oracle was started under the assumption that the problem was caused by the VSM tape systems. My intel tells me that 6 people from the customer worked on the investigation for a whole month and Oracle was unable to find anything wrong from a hardware or software perspective.

The investigation was temporarily abandoned, and the issue was left open with the perspective that the customer would eventually be getting new hardware, possibly from a different vendor. Shortly after the investigation was temporarily abandoned, the customer began a trial run of IntelliMagic Vision.

IntelliMagic personnel were onsite for this trial helping install IntelliMagic Vision and setting up data collection. The customer has a complex environment with 4 sysplexes (12 LPARS) and over 1 petabyte of primary online disk storage. During a live customer demonstration of the IntelliMagic Vision product, the customer explained the issue in detail, and we did a joint investigation during the meeting, comparing the application tape profile between various days.

Knowing the Unknowable

We showed the mounts that each job would generate, and the customer’s conclusion was that the application should never require many tape mounts. We concluded that IntelliMagic Vision did not yet provide a conclusive answer, but did point towards issues in the application, not in the Oracle VSM.

At night, after the meetings, we worked on product changes that would give conclusive answers to the issue at hand, such as improving the reporting of z/OS SMF tape data (adding some columns to a tabular report). The next morning, we introduced the product changes in the POC environment and reprocessed the data. We had a follow-on presentation with another group from the customer and showed them why the jobs were behaving differently on different days.

Using IntelliMagic Vision, it was extremely easy to see the behavior of the ill performing jobs, as far as tape activity was concerned. Based on the IntelliMagic Vision results, the customer knew exactly who had to fix the issue, and they turned to their application developers for design improvements.

Well Behaved Application Activity

Figure 1: Well Behaved Application Activity


Extreme Application Behavior

Figure 2: Extreme Application Behavior


Application Design Bug Led to Low Throughput

The problem was that on good days, each job would write 10 virtual tapes. On bad days, each job would write 150 virtual tapes. On top of that problem, the throughput to the VSM tape systems was low, not because of a problem in the VSM tape systems, but due to excessive processing in the application, causing low throughput to the virtual tapes.

Conclusion: This problem started with the assumption that the issue was with the hardware, but it turned out to be a bug in the application design. Having accurate measurements to describe the behavior of the hardware and the contribution of each job during execution provides a complete picture of the environment and can help to eliminate time and expense spent trying to fix the wrong problem.



How to Find Sick But Not Dead (SBND) TS7700 Tape Clusters

Rather than waiting for a remote VTS to fail, you should be reviewing these key TS7700 reports to determine if remote clusters are receiving replication data or not.

Read more

Are My Remote Clusters Receiving Replication?

Review these key reports when troubleshooting remote cluster issues or trying to determine if your remote clusters are receiving replication.

Watch video
Customer Experience

Colruyt Group IT opts for interactive mainframe analysis with IntelliMagic's expert knowledge

Colruyt Group IT needed out of the box reporting with built-in knowledge in the field of performance and capacity.

Read more

Go to Resources

This article's author

Dave Heggen
Storage Performance Consultant
Read Dave's bio

Share this blog

Subscribe to our Newsletter

Subscribe to our newsletter and receive monthly updates about the latest industry news and high quality content, like webinars, blogs, white papers, and more.