We were recently helping out a company in the same building as ours with a server issue they were having. They noticed it had rebooted out of nowhere a couple times within a week. None of the event logs showed anything, no crash dump file, pretty much no trace software wise. Luckily EventSentry sent us a 6009 event from the system log, letting us know that the server had rebooted. Knowing when this event occurs is great, especially on nights or weekends when users may not notice.
I was 99% sure it was a hardware issue since it was out of the blue with no recent hardware or software changes. We ran some basic diagnostics, including the ones from Dell. Everything kept coming back clean. After contacting Dell, they recommended re-seating all the RAM, the CPU’s, VRM’s etc. They have had problems in the past with CPU’s coming out of the sockets from the heatsink compound drying up and causing the same issue. I should have instantly noticed the problem then, but we will get to that.
None of this was helping and the reboots were becoming more and more frequent. The server was not under warranty so Dell couldn’t help out much more than that. I was actually amazed how helpful they were at all since it wasn’t. Their last suggestion was to start disabling hardware until we got to the root of the problem.
I went into device manager and disabled anything I could. The system became much more stable, although useless without the devices. I started enabling hardware again one at a time. After I enabled the built in NIC, the computer crashed. We threw in a PCI network card, and disabled the onboard NIC in the BIOS. The server booted up and all was great. For about 3 days…
The crashes started again, this time Windows couldn’t even finish loading before it would reboot. We opened the server again and this time I instantly saw what was wrong. I had seen this in a workstation before so I couldn’t believe I missed it. Almost all the capacitors on the board were bulging at the top.
This has become so common lately, I highly recommend looking for that right away on any critical server you have. There were even a few motherboard makers sued over this.
Some makers, like Gigabyte, are using solid state capacitors instead of the cheaper, more common electrolytic ones for some of their boards. I’m sure it costs them a little more, but for reliability I think it is completely worth it.
We ordered a new motherboard for the server, and sure enough it had a completely different brand of capacitors. Once we swapped it out and booted it up, the server has been running smooth. An extra $5 for some quality capacitors would have probably prevented the whole situation.
Here are some pictures of what to look for:
Taken from http://img.photobucket.com/albums/v711/whurd/Bad.jpg
The tops should be completely flat. If there is any bulging at all, it is most likely on its way out. The picture below shows leaking capacitors, also not a good thing.
Taken from http://macmedics.com/images/imac-logicaboard-with-leaking-capacitors.jpg
An interesting and related article has since been published on Wikipedia: Capacitor Plague.