Quantcast
Viewing all articles
Browse latest Browse all 2846

The watchdog timer expired.

I have (5) M1000e with a mixture of M610 blades and a handful of M620 blades.

Last year I had a couple blades generate email alerts with a message similar to:

"Critical Server 16 health changed to a critical state from either a normal or warning state."

Checking the CMC I found the blades were in error state:

"The watchdog timer expired."

I also found the iDRAC log showing the same watchdog error.

Intially it was one blade, then a couple more... and over a week I had a couple dozen blades across multiple M1000e. 

I'd rule out physical serer problems... there are a lot of servers experiencing this.  I'd rule out chassis/CMC problems for the same reason.

The blades flash their warning LEDs, and the CMC does the same.

Otherwise the servers are still working without issue.

I am using fairly new firmware, and the blades worked for months without a problem before this started happening.

To "fix" this I have performed a "racadm racreset" which clears the condition in the iDRAC but not the CMC.  I do a "racadm racreset" in the CMC, and the issue clears up everywhere.

It works fine for  couple weeks then starts again.

It only seems to be a problem with the M610 not M620. 

Some have ESXi 5.1 and others have ESXi 5.5 -- ESX version seems to be a non-issue, it happens for both versions.

Blades have 6.4.0 BIOS, and 1.6.0.73 firmware.

M1000e has mid-plane 1.1

CMC has A03 hardware and 4.50 firmware.

I reset all blades and CMCs again last night, and within 12 hours had one blade already enter this state.

Any ideas??


Viewing all articles
Browse latest Browse all 2846

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>