I have (5) M1000e with a mixture of M610 blades and a handful of M620 blades.
Last year I had a couple blades generate email alerts with a message similar to:
"Critical Server 16 health changed to a critical state from either a normal or warning state."
Checking the CMC I found the blades were in error state:
"The watchdog timer expired."
I also found the iDRAC log showing the same watchdog error.
Intially it was one blade, then a couple more... and over a week I had a couple dozen blades across multiple M1000e.
I'd rule out physical serer problems... there are a lot of servers experiencing this. I'd rule out chassis/CMC problems for the same reason.
The blades flash their warning LEDs, and the CMC does the same.
Otherwise the servers are still working without issue.
I am using fairly new firmware, and the blades worked for months without a problem before this started happening.
To "fix" this I have performed a "racadm racreset" which clears the condition in the iDRAC but not the CMC. I do a "racadm racreset" in the CMC, and the issue clears up everywhere.
It works fine for couple weeks then starts again.
It only seems to be a problem with the M610 not M620.
Some have ESXi 5.1 and others have ESXi 5.5 -- ESX version seems to be a non-issue, it happens for both versions.
Blades have 6.4.0 BIOS, and 1.6.0.73 firmware.
M1000e has mid-plane 1.1
CMC has A03 hardware and 4.50 firmware.
I reset all blades and CMCs again last night, and within 12 hours had one blade already enter this state.
Any ideas??