Saturday, December 19, 2009

Hunting the elusive great white system freeze

***********************************************************
***********************************************************
                                        31JAN2010 Update

My computer ended up not even being able to get past the BIOS POST screen. I ended up RMAing my Asus M4A79T Deluxe. While I was waiting for my Asus board to come back I bought a "Gigabyte GA-790FXTA-UD5" to use instead. The GA-790FXTA-UD5 has worked flawlessly since I installed it about three weeks ago. When my Asus board gets back I plan on eBaying it right away.

Morale of the story? When the going gets tough quit. The sooner the better. I wish I would've just RMAed the Asus board right away. On the plus side I do love my new Gigabyte board.

***********************************************************
***********************************************************
                                     Original Post

For a shorter version of this blog entry see my Asus forum post here.

This concise tale of woe and misery describes some of the suffering I endured to isolate an intermittent system freeze issue. The primary purpose of this post is to generate sympathy for me (and to get my bi-weekly blog entry online in time). Perhaps some will be so moved by my tragic story that they'll send monetary gifts to help ease my pain. Any gifts in excess of $100 will be great appreciated, don't get cheap on me now.

The troubleshooting steps described in this post may also help give some more troubleshooting ideas to others who are suffering from from similar computer instability issues. If my suffering can help even one person fix their computer more quickly, than all of my suffering will not have been worth it. Even if I help a million people with their computers, it still won't have been worth it. Slaying this instability beast was a horrible painful time drain. This is the kind of torture I would wish on my worst enemies. Only many large gifts of money could even begin to help ease my pain.

It all started July 2009. My wife's complaining about her system instability has seemed to increase with the frequency of her lock ups. I try to comfort her by repeatedly telling her that I'd be happy to install Fedora Linux on her computer to make all of her problems go away. Although she acknowledges that running an OS that supports about 1% of her apps should greatly improve her computer stability, she illogically (and repeatedly) turns down my generous offer of a fresh install of Linux. Perhaps if I had offered her Ubuntu instead? Alas it's too late for that now.

Our household computer standard is to refresh our desktop machines every 10 years. We're only about a year away from our desktops' tenth birthday, so I decided to bend the rules a bit and go ahead and build her a new PC now. I figured if I bought quality components for a change that building a new PC should be faster than troubleshooting her ancient desktop. I'm not sure that a simple re-install would've fixed her old PC, some hardware may have been going bad too.

In all my years of building computers with budget parts I've never had a major issue. Ironically the one time I shell out some extra green for "quality" parts I suffer the worse hardware problems I've ever had! My computer build part list is listed below. My two primary goals for this PC were:
1. Rock solid stability (ha ha ha ha ha)
2. Quiet (and low power)


[Parts list]
-Antec Sonata Designer 500 case
-Noctua NF-S12B ULN 120mm case fan
-Corsair VX450W power supply
-Asus M4A79T Deluxe with BIOS 2304 (I also tried BIOSes 2002-2205, all had the same symptoms)
-AMD Phenom II X4 905e Deneb 2.5GHz 4 x 512KB L2 Cache 6MB L3 Cache Socket AM3 65W Quad-Core Processor
-CT2KIT25672BA1339 - 4GB kit (2GBx2), 240-pin DIMM , DDR3 PC3-10600 from Crucial.com
http://www.crucial.com/store/partspecs.aspx?IMODULE=CT2KIT25672BA1339
* Module Size: 4GB kit (2GBx2)
* Package: 240-pin DIMM
* Feature: DDR3 1333 (PC3 10600)
* Specs: DDR3 PC3-10600 CL=9 Unbuffered ECC DDR3-1333 1.5V 256Meg x 72
-Scythe Mugen 2 CPU cooler
-SAPPHIRE 100252HDMI Radeon HD 4550 512MB 64-bit GDDR3 PCI Express 2.0 x16
-Western Digital Caviar Green WD10EADS 1TB (I also tried a "SAMSUNG Spinpoint F1 HD103UJ 1TB 7200 RPM 32MB Cache SATA 3.0Gb", but it didn't change anything)
-SAMSUNG S223L DVD±RW 22x SATA WHITE
-FA-8V08-WH 18-in-1 Floppy + Internal Flash Card Reader


I'm very pleased with the sound level of this PC. Very quiet. If I was to do it again I probably wouldn't worry about getting a special "quiet case". Just eliminating as many fans as possible and reducing the noise of the fans you do have to use seems to get you about 99% of the way to quiet PC goodness.

As you've probably guessed I'm not so pleased with my system stability. It all started out so well. As part of my computer build process I run memtest86+ (available on "System Rescue CD"), prime95 and OCCT. These all came back fine. Vista installed fine, everything installs fine. No problems whatsoever.

The first sign of something amiss was when I was taking an image of my C drive with the most excellent fsarchiver (ran from my favorite "System Rescue CD"). Note that fsarchiver is really a file system archiver, not a partition imager, but that just means it does a better job of creating "partition images" than a true partition imager does.

Anyway, System Rescue CD froze while running fsarchiver. This was would be quite embarrassing if Heather ever found out. Fortunately the only place where I've revealed this faux pas is in this blog entry which is far too long and boring for her to read. I chalked this freeze up to System Rescue CD using too new of a kernel. Perhaps in the latest kernel Linus replaced the SCO code he stole with some M$ code that supports system freezing?

So after rebooting and finishing taking the image of her C drive I give Heather her shiny new computer (complete with a "champagne pink" bezel). She seemed quite happy with it at first, but then the unthinkable happened, Vista froze! After the first freeze I was disappointed, but not worried. I just assumed that Vista had some sort of backwards compatibility with WinXP freezes and enduring a system lockup every month or two was a small price to pay to be a Bill Gates groupie.

Unfortunately these freezes weren't that infrequent. They'd happen once or twice a week. Infrequent enough to make troubleshooting and isolating the problem a nightmare, yet frequent enough to be fairly annoying. It also really rubbed me the wrong way to have this system that I paid a premium for (to get the best stability) locking up twice a week. She also reported that these freezes always seemed to happen when she clicked on something (start menu, starting a program etc.). This bit of info may prove useful....

Note that when I say freeze I mean that my computer just stops, the screen freezes with whatever's on the screen. Keyboard & mouse are unresponsive. Can't ping it over the network. I.e. it behaves just like a BSOD, except that instead of a blue screen it's just your regular Windows screen that's "frozen".

Something had to be done. Someone brave enough to face the dreaded "random & infrequent system freeze of death" had to be found! But what kind of man could have the infinite amount of courage and patience required to not only face this fearsome beast, but actually have a slim chance of victory? It hit me one morning after staring in the mirror at my ruggedly handsome face for half an hour that only I had the jeanious, good lucks, valor, courage and wit to face down this dastardly foe.

After saying goodbye to my family I began my journey into the heart of darkness (seems like there should be a witty Ubuntu reference here, but there isn't, get over it). My first goal was to be able to reproduce this problem at will. If the system only freezes once a week it could take years to isolate the cause. Fortunately I had actually used fsarchiver several times to take images of the C drive at different points (base install, with minimal apps, with full apps, etc.). I had several freezes while using fsarchiver so I started out using fsarchiver as my test program since the normal stress test programs (like prime95, occt, memetest86+) could not reproduce the issue.

I found that by running fsarchiver in a continuous loop (see below) I could get the system to freeze in a matter of hours. The commands I used to image my ntfs C drive to a file on my ntfs D drive (D:\1 directory) in an infinite loop looked like this:
ntfs-3g /dev/sda2 /mnt/backup
cd /mnt/backup/1
while [ 1 ]; do date; rm -f cdrive.fsa; fsarchiver -j 4 -z 8 savefs cdrive.fsa /dev/sda1; sleep 60; done

Unrelated to fsarchiver I also noticed some corrrected hardware error WHEA MCE (Machine Check Exception) logs within Vista (a few a week). These are found in the system log, source = "Microsoft-Windows-WHEA-Logger", Event ID = 19. Using AMD's mcat utility I was able to verify that these are reporting corrected ECC errors. They always reported a problem with memory bank 4. Maybe the RAM in slot 4 was bad? I swapped the two RAM modules, but the freezes continued and the MCE logs continued to point to bank 4. At this point I removed the RAM from slot 3 (I only have one 2GB stick on bank 4) and the freezes and logs continue.

I've now proved that RAM bank 4 was part of the problem, but was it the whole problem? Just for the fun of it I reset my BIOS settings to defaults and tested again with fsarchiver. I then went 24 hours without error! I added all of my BIOS settings back in and the freezes returned. So obviously the problem is a combination of RAM slot 4 and a BIOS setting(s).

At this point I could've started a binary search of my BIOS settings to see which one(s) were part of the problem. However, during my fsarchiver testing I found that (although I usually got a freeze within a few hours) I could go up to a dozen or so hours before getting a freeze. I.e. it could take quite a while to finish isolating this problem with my current fsarchiver freeze re-creation method. I decided to try to find a better way to (more quickly) re-produce the system freezes.

I was running fsarchiver with pretty intensive compression (the "-z 8" option). This caused fsarchiver to max out my CPU most (but not not all) of the time. When fsarchiver is waiting for more data from the disk it can't be compressing, so the CPU usage drops way down. I believe it actually will stop and go into halt state (very briefly) while waiting for more data from the hard drive to compress.

If maxing out the CPU with prime95 and OCCT doesn't cause a problem, but yet running fsarchiver with its varying levels of CPU usage does freeze the computer, maybe it's the act of bringing the CPU out of the halt state that contributes to this problem? This would also explain why Heather notices the freezes when she clicks on (i.e. starts) something.

I decided to try cycling prime95 by continuously starting prime95, stopping prime95 (and sleeping for a second) and then starting prime95 again. Success! With this prime95 cycle I cut my average time to freeze down from 3 hours to 30 minutes! The simple batch file I use to do this is listed in the [prime95cycle.cmd] section below.

prime95cycle.cmd is the batch file I used to cycle prime95 continuously. This works for me with Vista Business 32 bit. I believe it will work under all versions of Windows 7 as well. I know that WinXP doesn't have the "taskkill" or "timeout" commands, but it's easy enough to Google for WinXP replacements to those.


[prime95cycle.cmd]
REM Give Windows a minute to finish starting up
timeout 60 > nul

:loop
REM start prime95 in torture test mode (-t).
start "" D:\1\prime95.exe -t

REM Sleep while prime95 runs
timeout 1 > nul

REM Gracefully kill the prime95 torture test we started above
taskkill /im prime95.exe

REM Sleep for 2 seconds to allow prime95 time to close. Without this pause the next prime95 won't start since you can't have two instances of prime95 running at once.
timeout 2 > nul

REM rinse, repeat
goto loop



If you're running Linux prime95 is called "mprime" instead, so you'll need to run some commands like this:
while [ 1 ]; do mprime -t >/dev/null & sleep 1; pkill mprime; sleep 1; done

Maybe it's just superstition, but I believe performing frequent reboots also makes the problem more likely to occur. It seems that the computer is more likely to freeze after a reboot or cold start. Therefore I created a another simple script (see [reboot15min.cmd] below) that reboots my computer after 15 minutes. Even if rebooting isn't really necessary, this has the added benefit of clearing out any hung prime95 processes that taskkill couldn't get rid of (which happens occasionally).


[reboot15min.cmd]
REM 900=15 minutes
timeout 900 /nobreak > nul

REM Reboot
shutdown -r -c "Stress test reboot" -d p:0:0



Now all I need to do is put prime95cycle.cmd and reboot15min.cmd in my startup folder, reboot and then wait for the computer to freeze. Like I said before I averaged about 30 minutes until freezing using this method. However, for reasons I'm sure I'll never understand, my computer could still occasionally go hours without a freeze. Therefore while I was isolating which BIOS setting(s) were part of this problem I'd wait for 12 hours of stress testing to complete before assuming whatever set of BIOS settings I was using were safe.

The BIOS settings I changed from default are:
-Main->Storage Configuration = Set all SATA ports to "AHCI"
-Ai Tweaker->CPU Spread Spectrum = Disabled //Default is "Auto"
-Ai Tweaker->PCIE Spread Spectrum = Disabled //Default is "Auto"
-Advanced->CPU Configuration->Cool'n'Quiet = Enabled
-Advanced->CPU Configuration->C1E = Enabled
-Advanced->Chipset->Northbridge Configuration->ECC Configuration->ECC Mode = Good
-Advanced->USB Configuration->Legacy USB Support = Auto //Default is "Enabled"
-Power->ACPI 2.0 support = Enabled
-Power->APM Configuration->Power On By PME = Enabled //WOL
-Power->Hardware Monitor->CPU Q-Fan Function = Enabled
-Power->Hardware Monitor->Select Fan Type = PWM Fan
-Power->Hardware Monitor->-CPU Q-Fan Mode = Silent
-Boot->Boot Settings Configuration->Full Screen Logo = Disabled
-Tools->Express Gate = Disabled



After testing various combinations of BIOS settings I noticed that when I have ECC enabled in the BIOS I get freezes (not BSODs) and the occasional WHEA MCE log telling me about a corrected memory error in RAM bank 4. If ECC is disabled I just got a BSOD instead. Therefore if you're like 99.99999% of home computer users who don't use ECC RAM you'll need to ensure that your computer doesn't automatically reboot after a BSOD by following the steps here: http://www.internetfixes.com/vista_tips/IF01186.htm

After all this work I can finally state what my problem is. With C1E enabled and RAM in slot 4 (the black slot farthest away from the CPU) I get intermittent freezes and/or BSODs. The problem seems to be with the CPU coming out of the C1E state and delivering power to RAM slot 4. Enabling Q-Fan and frequently rebooting the computer seem to help the problem occur more quickly/often, but they are not required to re-produce this issue.

I think the problem is with the CPU coming out of C1E state for two reasons:
1. Most obvious reason is that the problem only occurs if C1E is enabled. The only default BIOS setting you need to change to get this problem is to enable C1E (however enabling QFan does help the problem to occur more often, more on this later).
2. I can run stress tests (like OCCT and prime95) for days without error. This proves my system is stable under load and the problem is not heat related. However, if I continuously start prime95, stop (and sleep for a second) and then start prime95 again I can usually get the problem to occur within an hour (with Q-Fan enabled, more on that below).

I believe this issue is power related because enabling Qfan seems to help reproduce the problem quicker. In my testing with a standard 60mm size CPU fan I can get the problem to happen most quickly with Q-Fan enabled and set to "PWM Performance", "DC Silent" or "DC Optimal". Using a larger 120mm fan (which runs at a much lower RPM) instead it seems that "PWM Silent" or "PWM Optimal" are the settings most likely to reproduce this problem. "DC Silent" also works, but since my 120mm fan never actually spins with this setting I don't like to use it.

Note, however, that enabling Q-Fan is NOT required to have this problem, it just makes the problem much more likely to happen. I've even gotten my computer to freeze with no CPU fan connected at all (yes, my temps were fine, I have huge heatsink and don't even really need a fan). My best attempt at an explanation for this is that whenever the system is using "some magic amount of power" and the computer comes out of C1E it can't deliver the right amount of power to RAM slot 4.

I know the problem is with RAM slot 4 because:
1. All of my WHEA MCE (Machine Check Exception) logs always indicate a problem with slot 4
2. When I try different slots and sticks of RAM the problem only occurs if I have a stick of RAM in slot 4.


[Other things I tried that didn't help]
Changing RAM voltage didn't help. My Crucial RAM specs says that it should use 1.50 volts. With the BIOS default setting of "AUTO" for the RAM voltage, the voltage actually gets set to 1.60 volts. I've tried setting a range of RAM voltages from 1.50-1.66 volts (in 2/100th increments), but it didn't help. I don't want to go higher than 1.66 volts for fear of frying my RAM.

Changing the OS didn't help. I got the same freezes when using Fedora 12 Linux. This obviously exonerates all of my Windows drivers, programs, etc. as being part of the problem. I should also note that the vast majority of my testing was done with a base install of Vista business 32 bit. Pretty much just did a default install and ran Windows Update.


[To keep this post from being too short here are some free bonus bug hunting tips]
1. In Linux the MCE logs are in /var/log/mcelog
2. Fedora 12 (64 bit) writes the MCE logs to disk once an hour in this cron job "/etc/cron.hourly/mcelog.cron". Since my system would usually freeze before the MCE logs got written to disk I deleted the hourly cron job and instead added this job to "/etc/cronttab" so that the MCE logs would get written out every minute:
* * * * * root /usr/sbin/mcelog --ignorenodev --filter >> /var/log/mcelog
3. In Windows you can use the "AMD MCAT Machine Check Analysis Tool" to get more details on the MCE logs.
http://support.amd.com/us/Pages/dynamicDetails.aspx?ListID=c5cd2c08-1432-4756-aafa-4d9dc646342f&ItemID=178
-Machine Check Analysis Tool (MCAT) is a command line utility that takes a Windows System Event Log (.evt) file as an argument and decodes the MCA Error logs into human readable format. MCAT can alternatively take as an argument the raw register hexadecimal values from an MCE Error.
-When I used this on Vista it couldn't read the event log directly. However I could manually give it the numbers from the event log for it to decode. I used commands like this:
mcat /cmd 4 0xf4782000e0080a13 0x90b2c700 0x2231eaa00000000
mcat 4 0x94744000eb080a13 0x30a4bc0 0x226dbd200000000
mcat 4 0x94214000b5080813 0x15ba7c00 0x225385200000000
4. In Vista it's helpful to create a custom log view that shows serious errors. Then by using this custom log view you can quickly see if you had any serious errors logged. My custom view is setup like this:
-Logged: Any time
-Event logs: System
-Event Sources: BugCheck, eventlog, Eventlog, WHEA-Logger
-Includes Event IDs: 19,1001,6008



In the end, the problem is probably with my "Asus M4A79T Deluxe" motherboard. Since disabling C1E fixes the issue, I'm hopeful that a simple BIOS update can fix this (I've got a ticket open with ASUS on this issue). I know I could avoid this issue simply by using the orange RAM slots and/or disabling C1E. However, I paid good money for this board, so everything should work dadgummit! C1E saves me ~ 7 watts of of power and who knows when I'll want to run 4 sticks of RAM.

Other than this issue, I really like this motherboard. It's got good support for ECC RAM. Plenty of old school connections like PS2 keyboard and mouse, serial port (need to buy the cable separately), PCI, IDE and floppy connections. All this in addition to modern conveniences like a very customizable BIOS, easy BIOS updates, SATA, eSATA, 4 PCIe slots, lots of USB2.0 ports, firewire, etc.

2 comments:

Daniel Johnson said...

It will be interesting to see if Asus has a clue at all. I think you should demand compensation for the extensive trouble-shooting; their own engineers could not have done a better job. If that fails, AND the donations fail, well, there's always Google AdSense for all the people desperately seeking ECC Ram slot 4 fixes.

What's the possibility you just have a bad MB?

Jim said...

My computer ended up not even being able to get past the BIOS POST screen. I ended up RMAing the board. While I was waiting for my Asus board to come back I bought a "Gigabyte GA-790FXTA-UD5" to use instead. This one has worked flawlessly since I installed it about three weeks ago. When my Asus board gets back I plan on eBaying it right away.

Morale of the story? When the going gets tough quit. The sooner the better. I wish I would've just RMAed the Asus board right away. On the plus side I do love my new Gigabyte board.