Dave Heggen - 8 June 2018

Our story begins as our stories usually do, somewhere in the middle after the customer has been working on a problem, in this case low throughput to virtual tape, for a while and they are just about to give-up.

The customer had a group of tape jobs that frequently would not finish on time. On time means the jobs complete within the batch window. Not on time means they would run beyond the batch window and would compete with the online activity. Some days the job would run in less than an hour, while the same job on other days would run for 10 to 15 hours.

Low Throughput to VSM Tape Systems

The customer investigated the issue when the jobs ran long and found that the throughput to the VSM tape systems was low for the jobs in question. A joint investigation with Oracle was started under the assumption that the problem was caused by the VSM tape systems. My intel tells me that 6 people from the customer worked on the investigation for a whole month and Oracle was unable to find anything wrong from a hardware or software perspective.

The investigation was temporarily abandoned, and the issue was left open with the perspective that the customer would eventually be getting new hardware, possibly from a different vendor. Shortly after the investigation was temporarily abandoned, the customer began a trial run of IntelliMagic Vision.

IntelliMagic personnel were onsite for this trial helping install IntelliMagic Vision and setting up data collection. The customer has a complex environment with 4 sysplexes (12 LPARS) and over 1 petabyte of primary online disk storage. During a live customer demonstration of the IntelliMagic Vision product, the customer explained the issue in detail, and we did a joint investigation during the meeting, comparing the application tape profile between various days.

Knowing the Unknowable

We showed the mounts that each job would generate, and the customer’s conclusion was that the application should never require many tape mounts. We concluded that IntelliMagic Vision did not yet provide a conclusive answer, but did point towards issues in the application, not in the Oracle VSM.

At night, after the meetings, we worked on product changes that would give conclusive answers to the issue at hand, such as improving the reporting of z/OS SMF tape data (adding some columns to a tabular report). The next morning, we introduced the product changes in the POC environment and reprocessed the data. We had a follow-on presentation with another group from the customer and showed them why the jobs were behaving differently on different days.

Using IntelliMagic Vision, it was extremely easy to see the behavior of the ill performing jobs, as far as tape activity was concerned. Based on the IntelliMagic Vision results, the customer knew exactly who had to fix the issue, and they turned to their application developers for design improvements.

Well Behaved Application Activity
Figure 1: Well Behaved Application Activity


Extreme Application Behavior
Figure 2: Extreme Application Behavior


Application Design Bug Led to Low Throughput

The problem was that on good days, each job would write 10 virtual tapes. On bad days, each job would write 150 virtual tapes. On top of that problem, the throughput to the VSM tape systems was low, not because of a problem in the VSM tape systems, but due to excessive processing in the application, causing low throughput to the virtual tapes.

Conclusion: This problem started with the assumption that the issue was with the hardware, but it turned out to be a bug in the application design. Having accurate measurements to describe the behavior of the hardware and the contribution of each job during execution provides a complete picture of the environment and can help to eliminate time and expense spent trying to fix the wrong problem.



6 Reports Every IBM TS7700 Performance Analyst Should Have

Rather than reviewing every indicator across your entire tape grid, there are 6 key reports that you should be reviewing regularly to keep your TS7700 in good working order.

Read more

IBM z15 Announcement Highlights and How to Take Advantage

The z15 (with a General Availability date of 9/23/2019) offers up to 190 CPU cores (vs. 170 on z14) and 40 TB of usable memory (vs. 32 on z14), in addition to processor cache and overall performance improvements.

Read more

CPU – Just the Tip of the z/OS Iceberg

z/OS is now responsible for everything from online transactions, network traffic, replicated data, and so much more. The infrastructure to handle this complexity is no longer limited to ensuring CPU is going to be okay.

Read more

Go to Resources

This article's author

Dave Heggen
Storage Performance Consultant
Read Dave's bio

Share this blog

Subscribe to our Newsletter

Subscribe to our newsletter and receive monthly updates about the latest industry news and high quality content, like webinars, blogs, white papers, and more.