Encoding errors – are they distracting noise or important information that shouldn’t be ignored? There’s so much noise in a big fabric that it can be hard to know what should be investigated. In this blog we will discuss the meaning of a few of the common errors we see and discuss the risk associated with ignoring them as well as the best way to address them.
Encoding Errors in Your SAN Fabric
Encoding errors are low-level errors that indicate encoding disparity inside frames. These are errors that happen with Fibre Channel 1 standard encoding 8 to 10 bits and back, or, with 10G and 16G FC from 64 bits to 66 and back.
Since these happen on the bits that are part of a data frame, these are counted in this column. If the Cyclic Redundancy Check (CRC) Errors increase as well, then there is likely a physical link issue which could be resolved by cleaning connectors, replacing a cable, or replacing the small form pluggable (SFP) and/or Host Bus Adapter (HBA). If there are no CRC errors, then it is likely an issue with the cable only.
Figure 1 below shows the top 20 ports with Encoding Errors. The offending port, slot4 port6, is highlighted in the legend in bold.
In this case, we received an alert about the encoding errors as well as the CRC errors. While looking at the chart of the errors over time as shown in Figure 2, we noticed the same pattern.
Figure 2 shows the number of CRC errors. Encoding errors might lead to a CRC error, however, this metric shows frames that have been marked as invalid frames because of a CRC error earlier in the datapath.
According to FC specifications, it is up to the implementation of the programmer if he wants to discard the frame right away or mark it as invalid and send it to the destination anyway. There are pros and cons to both scenarios.
Essentially, if you see CRC errors it means the port has received a frame with an incorrect CRC, but this occurred further upstream. If the Encoding Errors increase as well, there is a physical link issue which could be resolved by cleaning connectors, replacing a cable or (in rare cases) replacing the SFP and/or HBA.
Fibre channel design best practice stipulates that each host maintains at least two paths to the data in order to maintain redundancy at all times. If one path fails due to pathing issues, then the redundancy is removed and the host is vulnerable to losing connection to the data.
In this case, we started to receive these errors along with a few other errors, and we alerted the customer to check the cable and SFP. During the period from 10:00 am on 11/11/2019 and 11:00 AM on 11/12/2019, during which the host initiator port was throwing these errors, the host only had a single path to the fabric. Fortunately, the other path was functional, but during this time period the availability risk of this server was high.
On 11/12 the cable connection was inspected in the data center and was found to be loose. Someone may have been working on the fabric and bumped the cable. It was reconnected tightly at around 11:00 and the errors and the datapath availability was resolved.
Avoid Risks in your Fabric Environment
In this blog we looked at some of the key indicators that your fabric is behaving well. It is essential that you monitor your fabric and alert when there are errors, as the errors are often a warning that something is going to fail soon or an alert that something has already failed.
It is important to understand the errors so you can treat them in accordance with their severity and impact. In order to effectively do this, you must collect the right data and set up the appropriate alerting mechanisms.
Are you monitoring your fabric for these types of issues? Do you have a way to filter out the false positives? Do you understand what all the errors signify, and which ones can be ignored? If you would like to have a review of the health of you SAN fabric or engage IntelliMagic for proactive monitoring of your fabric please let us know how we can help by sending an email to email@example.com.
Subscribe to our Blogs
Subscribe to our newsletter and receive monthly updates about the latest industry news and high quality content, like webinars, blogs, white papers, and more.
NVMe over Fabric – Fibre Channel
Almost all new storage offerings these days offer some level of support for Non-volatile Memory Express (NVMe). The purpose of this blog is to help the reader understand NVMe and prepare them for successful adoption of this powerful technology.
Storage System 101: Architecture, History, Trends, and Predictions
The purpose of this webinar is to help you understand the history and trends in disk technology and understand the key components of a storage system.
XtremIO Performance Monitoring
This short video shows how one can quickly evaluate key performance metrics, spot issues affecting your XtremIO array, and troubleshoot that back to the root cause.