Brett Allison - 7 February 2020

Encoding errors – are they distracting noise or important information that shouldn’t be ignored? There’s so much noise in a big fabric that it can be hard to know what should be investigated. In this blog we will discuss the meaning of a few of the common errors we see and discuss the risk associated with ignoring them as well as the best way to address them.

Encoding Errors in Your SAN Fabric

Encoding errors are low-level errors that indicate encoding disparity inside frames. These are errors that happen with Fibre Channel 1 standard encoding 8 to 10 bits and back, or, with 10G and 16G FC from 64 bits to 66 and back.

Since these happen on the bits that are part of a data frame, these are counted in this column. If the Cyclic Redundancy Check (CRC) Errors increase as well, then there is likely a physical link issue which could be resolved by cleaning connectors, replacing a cable, or replacing the small form pluggable (SFP) and/or Host Bus Adapter (HBA). If there are no CRC errors, then it is likely an issue with the cable only.

Figure 1 below shows the top 20 ports with Encoding Errors. The offending port, slot4 port6, is highlighted in the legend in bold.

Encoding Errors (top 20)

Figure 1: Encoding Errors (top 20)

 

In this case, we received an alert about the encoding errors as well as the CRC errors. While looking at the chart of the errors over time as shown in Figure 2, we noticed the same pattern.

Cyclic Redundancy Check (CRC) Errors

Figure 2: Cyclic Redundancy Check (CRC) Errors

 

Figure 2 shows the number of CRC errors. Encoding errors might lead to a CRC error, however, this metric shows frames that have been marked as invalid frames because of a CRC error earlier in the datapath.

According to FC specifications, it is up to the implementation of the programmer if he wants to discard the frame right away or mark it as invalid and send it to the destination anyway. There are pros and cons to both scenarios.

Essentially, if you see CRC errors it means the port has received a frame with an incorrect CRC, but this occurred further upstream. If the Encoding Errors increase as well, there is a physical link issue which could be resolved by cleaning connectors, replacing a cable or (in rare cases) replacing the SFP and/or HBA.

Fibre channel design best practice stipulates that each host maintains at least two paths to the data in order to maintain redundancy at all times. If one path fails due to pathing issues, then the redundancy is removed and the host is vulnerable to losing connection to the data.

In this case, we started to receive these errors along with a few other errors, and we alerted the customer to check the cable and SFP. During the period from 10:00 am on 11/11/2019 and 11:00 AM on 11/12/2019, during which the host initiator port was throwing these errors, the host only had a single path to the fabric. Fortunately, the other path was functional, but during this time period the availability risk of this server was high.

On 11/12 the cable connection was inspected in the data center and was found to be loose. Someone may have been working on the fabric and bumped the cable. It was reconnected tightly at around 11:00 and the errors and the datapath availability was resolved.

Avoid Risks in your Fabric Environment

In this blog we looked at some of the key indicators that your fabric is behaving well. It is essential that you monitor your fabric and alert when there are errors, as the errors are often a warning that something is going to fail soon or an alert that something has already failed.

It is important to understand the errors so you can treat them in accordance with their severity and impact. In order to effectively do this, you must collect the right data and set up the appropriate alerting mechanisms.

Are you monitoring your fabric for these types of issues? Do you have a way to filter out the false positives? Do you understand what all the errors signify, and which ones can be ignored? If you would like to have a review of the health of you SAN fabric or engage IntelliMagic for proactive monitoring of your fabric please let us know how we can help by sending an email to info@intellimagic.com.

This article's author

Brett Allison
Director of Technical Services
More from Brett

Share this blog

5 Things Every Storage Professional Should Be Checking

Subscribe to our Blogs

Subscribe to our newsletter and receive monthly updates about the latest industry news and high quality content, like webinars, blogs, white papers, and more.

Related Resources

Blog

NVMe over Fabric – Fibre Channel

Almost all new storage offerings these days offer some level of support for Non-volatile Memory Express (NVMe). The purpose of this blog is to help the reader understand NVMe and prepare them for successful adoption of this powerful technology.

Read more
Webinar

Storage System 101: Architecture, History, Trends, and Predictions

The purpose of this webinar is to help you understand the history and trends in disk technology and understand the key components of a storage system.

Watch Webinar
Video

XtremIO Performance Monitoring

This short video shows how one can quickly evaluate key performance metrics, spot issues affecting your XtremIO array, and troubleshoot that back to the root cause.

Watch video

Go to Resources