reactiveLabs

How to Tell if Your PERC H730 Is About to Grenade a VSAN Host

The PERC H730 has been supported by VSAN for many releases but has a rocky road. There have been many firmware/driver revisions to fix many of the bugs which exist in it’s implementation of pass-through HBA mode. Some of these have even been pulled by Dell/VMware after release because of critical bugs. The latest release 25.5.5.005 seems to fix most of these. We have a fairly large estate of VSAN clusters using the H730 which we were in process of updating, so many were still running on older firmware/driver combos. One nasty bug that can crop up in the older revisions will cause the controller to lock up, causing VSAN to issue controller resets and eventually declare all disk groups on the host degraded. This will then cause you to a resync of all data that was on the host, exposing you to data loss if there is a further failure and you are running FTT=1 on any VMs. The fix of rebooting the affected host is simple and waiting for resync is simple but I’d rather avoid the whole situation in the first place. After this had happened a few times, I did some deep dive into the ESXi logs and figured out an early warning system for this bug.

Symptoms:

  • Disk begin being reset periodically. You can check for these in the DRAC Lifecycle log or ESXi log. Note that they do NOT show up in the regular DRAC system log and furthermore the Lifecycle log states these are normal events and not something to be concerned about.

ESXi log strings to look for:

  • Fast Path Status Updates
  • Online Disk Reset – ODR is by far the most reliable indicator
  • 0x10c
  • 0x10d

We set up alerts in Log Insight but something like greylog would also work great for this. The alerts trigger on ODR for 1 hosts when they exceed thresholds of 40 in one hour or 20 over 6 hours. You may need to adjust for your hosts particular number of disks/disk groups. With the warning system in place, you will get alerts usually several hours before a disk controller reset will happen. This gives you plenty of time to evacuate the host of VMs, put it into maintenance mode, and reboot it. Note that if you need to do a full evacuation of VSAN data from the host, this is probably not enough time but the short risk window introduced by the reboot is much better than the long risk window caused by the controller reset.

I wholly recommend (Dell/VMware concur) to avoid the H730 if possible for VSAN deployments on Dell nodes and use the HBA330 instead.