Enterprise Level Storage Systems Best Practices

Stuart Plotkin - 6 February 2017

The journey to create best practices for enterprise level storage systems

For many years, I have had the opportunity to be involved in addressing enterprise infrastructure performance challenges. When I mention to friends that I am a storage performance expert, they occasionally respond: “Me too, I have this huge hard drive on my laptop and I make sure that the IO blazes.”

It is then that I smile and calmly explain that, on your laptop, you are the only user. In my world, I have hundreds to thousands of people all wanting to use hard drives at the same time. In these cases, the hard drive performance is mission critical and needs to be in top condition 24/7.

It is then that they start to get the picture.

Confessions of a Storage Performance Analyst

To maximize positive results, I cannot over-emphasize the need for proactive performance engineering for enterprise storage systems. In the first phase of a company’s use of a shared storage infrastructure, users and applications are added and the technology typically hums along incredibly fast. While it may take a while for a company to begin to fill up a storage system – once it does – it is too late to avoid performance issues.

When storage capacity is reaching its limit, slow response times and related issues become the norm. The best way to maximize the efficiency of shared resources making up a Storage Area Network (SAN), is to begin performance engineering from the get-go.

Users of distributed and open systems are used to sharing servers, network and devices – but proactively performing performance engineering measures is often another matter. This is illogical, at best, as a storage device with hundreds of production applications is comparable to the effort required for performance engineering on a Mainframe. Workload balancing and configuration tuning are critical to keep the system running efficiently.

Establishing Industry Best Practices for Storage Performance Management

When I was asked to take on a project to establish performance best practices for a set of robust enterprise level storage systems, I knew that the undertaking would require me to:

Identify key performance metrics
Trend the data across all storage systems
Create metric thresholds
Set up pro-active automated notifications

What I didn’t realize, at least at the time, was that these steps would put me on the critical path to establishing industry best practices. I also didn’t know that the task would be close to impossible, given the state of the current storage vendor performance management offerings. I also did not know that IntelliMagic Vision offered all of this and more in a user-friendly, heterogeneous package.

Had I known the strengths and flexibility of the IntelliMagic software, I could have saved considerable time and money. And, I would have been able to deliver to the application teams a level of visibility and productivity that would have greatly increased the value and performance of their mission critical applications.

Best Practice #1 Identify key performance metrics

As I began my journey, I realized that dozens of metrics were available to me. Some even came labeled as “critical.” However, I also discovered, as problems emerged, that these supposedly “critical” metrics would not predict all of the problems we were seeing.

I realized I needed to develop my own short list of key metrics. For example, storage systems will defer actual writes to disk to save on IO time, temporarily storing the results in cache. However, when the cache becomes too full, the writes need to be written from cache to disk, which can impact the read IO speed. Hence, the best metric for this condition needs to be selected and monitored.

Best Practice #2 Trend the data across all storage systems

Once I had identified the key metrics I felt were most meaningful, my next challenge was to collect, store, trend and analyze them – especially across multiple storage systems and locations.

Luckily, I was able to utilize several off-shore, low-cost resources and put to them work on an intensely manual effort. Unfortunately, there was no API available to design and implement an automated solution.

Best Practice #3 Create metric thresholds

To define thresholds required answers to a series of questions:

What level of metric utilization would be considered good?
What level would be bad?
When would we be hitting the knee in the curve?

Getting to the answers led to research and testing, as well as seeking out the advice of many peers. Progress was very slow.

While attending a storage vendor conference, one very senior member of a storage company suggested running all metrics at 50% to ensure top performance. However, to me, that was comparable to telling the Euro Railways to always run with only half the passengers it could accommodate to guarantee that they would get to where they needed to be on time.

While they might get there fast, the company could also quickly go out of business.

Best Practice #4 Set up pro-active automated notifications

While we had expended a lot of resources to develop our own key metrics and manually collect the data, we found that threshold data simply wasn’t available. Upon hitting this wall, we were not in a position to set up pro-active notifications. For example, we wanted to be notified when a front-end storage system port became flooded with IO from one host, impacting all other users of the shared resource. With proper pro-active notification, this condition could be detected and the workload re-balanced.

But, despite the limitations, the project ultimately was considered a success, primarily because we were able to achieve our goal to develop best practices.

However, if I’d had the IntelliMagic software available to me, I could have saved my company considerable time and expense, at a significantly higher level of competency.

Maintaining High-Performing Systems is a Business Imperative.

Out of the box, IntelliMagic Vision would have provided us the key metrics we needed, as well as proactive dashboards, dynamic thresholds, automated notifications and more.

As storage environments continue to grow and become more complex, maintaining stable, high-performing systems is a business imperative.

The years of solid expert knowledge and experience that have been incorporated into IntelliMagic’s software solutions will keep you ahead of the game and arm you with the confidence to run your competitive, business critical applications on your storage infrastructure.

Related Resources

Webinar

Challenging the Skills Gap – The Next Generation Mainframers | IntelliMagic zAcademy

Hear from these young mainframe professionals on why they chose this career path and why they reject the notion that mainframes are obsolete.

Watch Webinar

Cheryl Watson's Tuning Letter

Making Sense of the Many I/O Count Fields in SMF | Cheryl Watson's Tuning Letter

In this reprint from Cheryl Watson’s Tuning Letter, Todd Havekost addresses a question about the different fields in SMF records having different values.

Insights into Subsystem and Hardware Configurations through Topology Views | IntelliMagic zAcademy

Explore mainframe complexities and visualize hardware/software connections with Todd and John in our webinar. Gain insights into ensuring application resilience.