Why create this post? Because I couldn't find a good website that showed how ECC should be configured for an above average home ECC user like me. I've found the answers to some of my questions and have taken educated guesses at what other settings should be. I hope this post can either help someone else out, or at least get someone to point out where I can find an easy to understand authoritative source on these settings.
Basically ECC RAM is able to correct a single bit memory error on the fly (and report on double bit memory errors). In theory this gives you better system uptime (less BSODs). The downside to ECC RAM is that you take a 0.5-2% performance hit (depending on the type of app you're running) and it costs more.
According to this article the more ECC features you enable the more of a performance hit you take. So if you're a gamer you might want to enable just the basic ECC checking. Whether it costs a little or a lot more depends on timing and how savvy of a shopper you are.
More details on what ECC RAM is and the theory behind it is easily found via Google. One interesting site is this summary of Google's 2.5-year study of DRAM error rates.
What do you need to run ECC RAM? Well for starters you must have ECC RAM. Most RAM is not ECC RAM. ECC RAM is special. In my "Gigabyte GA-790FXTA-UD5" motherboard I use this RAM:
CT2KIT25672BA1339 - 4GB kit (2GBx2), 240-pin DIMM , DDR3 PC3-10600 from Crucial.com
* Module Size: 4GB kit (2GBx2)
* Package: 240-pin DIMM
* Feature: DDR3 1333 (PC3 10600)
* Specs: DDR3 PC3-10600 • CL=9 • Unbuffered • ECC • DDR3-1333 • 1.5V • 256Meg x 72
You also need a motherboard that supports ECC RAM. AFAIK all (or maybe all non-laptop) AMD CPUs support ECC RAM (the ECC RAM controller is actually in the CPU), so every AMD motherboard should support ECC RAM. Some motherboard manuals will show you all the ECC settings. However my GA-790FXTA-UD5 manual didn't even mention ECC RAM. Its website did say that ECC was supported, so I took a gamble and bought this motherboard (which paid off). My CPU is a "AMD Phenom II X4 905e Deneb 2.5GHz 4 x 512KB L2 Cache 6MB L3 Cache Socket AM3 65W Quad-Core Processor".
So now that you've got an AMD CPU, AMD motherboard and ECC RAM what do you need to do next? Just slap it all together and turn your computer on and everything should work. However it's not quite that simple. Every motherboard I've seen has the ECC function disabled by default.
All (AMD) motherboard ECC settings should be similar. My ECC settings are found under "MB Intelligent Tweaker (M.I.T.)->DRAM Configuration". My (mainly plagiarized) definitions of these settings are:
- DRAM ECC enable = Turn on ECC
- DRAM MCE enable = Generate Machine Check Exception logs
- Chip-Killl mode enable = Can correct some multibit errors. Aka "4-Bit ECC"
- DRAM ECC Redirection = When a single bit ECC error is found write the corrected data back to RAM (i.e. scrub just the location with the ECC error)
- DRAM background scrubber = Interval between main memory scrubs (64 bytes)
- L2 cache background scrubber = Interval between main L2 cache scrubs (one single L2 cache line tag address)
- DCache background scrubber = Interval between data cache scrubs (64 bits)
- DRAM ECC enable Enabled
- Why have ECC RAM if you're not going to use it?
- DRAM MCE enable Enabled
- MCE logs let you keep track of how many ECC errors you have (i.e. how many times ECC silently corrected a single bit error and saved you from a crash)
- Chip-Killl mode enable Enabled
- Correcting multi-bit errors sounds good to me.
- DRAM ECC Redirection Enabled
- See my section below about scrubbing
- DRAM background scrubber 10.49ms (scrub 4GB in ~187 hours)
- See my section below about scrubbing
- L2 cache background scrubber Disabled
- DCache background scrubber Disabled
- I don't scrub either of my caches since this PDF says it's not worth it.
The other reason you wouldn't have to correct ECC errors found in RAM is that you can turn on scrubbing. Scrubbing is basically a background process (normal ECC checks only RAM values that your CPU has requested) that goes through all of your RAM, reading small 64 byte sections of RAM and checking it for ECC errors. If it finds an error it corrects the error and writes the updated (and corrected) value back to RAM.
The main reason to scrub your RAM is to avoid getting two bit errors. I.e. a single cosmic ray flips one of your bits. If you leave your computer on 24x7 and you don't have "DRAM ECC Redirection" enabled you will eventually have another cosmic ray flip another bit in the same "ECC area" as your first bit flip. With a 2 bit error ECC can't fix it, only report it (and freeze your computer). In summary , scrubbing greatly decreases the odds that you'll get a two bit ECC error.
Note that with Windows Vista and Windows 7 by default your computer sleeps when you shut it down. In sleep mode your RAM is still powered and therefore still capable of taking errors. I.e. as far as your RAM is concerned,if you use sleep mode your computer is on 24x7 and you should be scrubbing your RAM. If you truly power off your computer or reboot it regularly (like Windows Update is known to do) you probably don't need to scrub your RAM.
I configure my system to scrub 64bytes (the amount of RAM scrubbed is not configurable) of RAM every 10.49ms. This will scrub my entire 4GB of RAM in ~187 hours. That may seem like a long time, but you need to remember that ECC errors are pretty rare anyway. Since you really only need to be scared of two bit errors, the odds of getting any two bit errors are pretty slim. Additionally since I have "DRAM ECC Redirection" enabled I'm in effect running another scrubber.
Another random source I found useful is the "AMD Hammer Family Processor BIOS and Kernel Developer's Guide." It's not the easiest read, but there's some good info in there.
Now you know how/why I configure my ECC RAM to work for me. I'm not an authoritative source on ECC RAM. My settings may not be the best for you. They may not even be the best for me! With that dire warning aside, there is one more thing I should mention.
How do you monitor your ECC RAM? How do you know that it's actually working and correcting errors? Via the magic of MCE (Machine Check Exception) logs of course!
In Linux the MCE logs are in "/var/log/mcelog". Fedora 12 (64 bit) writes the MCE logs to disk once an hour in this cron job "/etc/cron.hourly/mcelog.cron". If you're having system stability issues (i.e. your computer freezes before the MCE logs get written to disk) you could delete the hourly cron job and instead add this job to "/etc/cronttab" so that the MCE logs would get written out every minute:
* * * * * root /usr/sbin/mcelog --ignorenodev --filter >> /var/log/mcelog
This site has more good Linux info:
Here's what an corrected ECC error log entry might look like from "/var/log/mcelog":
Fri Oct 12 22:11:47 1492
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge MISC c008000101000000 ADDR afe00
Northbridge RAM Chipkill ECC error
Chipkill ECC syndrome = b542
bit46 = corrected ecc error
bit59 = misc error valid
bus error 'local node origin, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS 9c214000b5080813 MCGSTATUS 0
CPUID Vendor AMD Family 16 Model 4
WARNING: SMBIOS data is often unreliable. Take with a grain of salt!
DDR DIMM 1333 Mhz Synchronous Width 64 Data Width 64 Size 2 GB
Device Locator: DIMM3
Bank Locator: BANK3
Serial Number: SerNum03
Asset Tag: AssetTagNum3
Part Number: ModulePartNumber03
Note that this Linux mcelog reports the error in "DIMM3/BANK3". It's important to know that Linux starts counting your memory banks at 0. So if you have four RAM slots Linux numbers them from 0-3. I.e. "DIMM3/BANK3" is actually your 4th memory bank. As you'll see below Windows (unlike Linux) starts counting at 1, so this same error in Windows would be reported as coming from "Bank Number: 4".
In Windows you'll get a log like this in your system event log:
Log Name: System
Date: 10/12/1492 11:20:48 AM
Event ID: 19
Task Category: None
User: LOCAL SERVICE
A corrected hardware error occurred.
Error Source: Corrected Machine Check
Error Type: Bus/Interconnect Error
Processor ID Valid: Yes
Processor ID: 0x0
Bank Number: 4
Transaction Type: N/A
Processor Participation: Local node responded to the request
Request Type: Generic Read
Memory Hierarchy Level: Generic
Note how it's even nice enough to tell you which memory bank had the error? If you start getting a lot of MCE logs from one bank, maybe you simply have a stick of RAM going bad, get it replaced.
In Windows you can use the "AMD MCAT Machine Check Analysis Tool" to get more details on the MCE logs.
Machine Check Analysis Tool (MCAT) is a command line utility that takes a Windows System Event Log (.evt) file as an argument and decodes the MCA Error logs into human readable format. MCAT can alternatively take as an argument the raw register hexadecimal values from an MCE Error.
When I used this on Vista it couldn't read the event log directly. However I could manually give it the numbers from the event log XML details for it to decode. I used commands like this:
mcat /cmd 4 0xf4782000e0080a13 0x90b2c700 0x2231eaa00000000
mcat 4 0x94744000eb080a13 0x30a4bc0 0x226dbd200000000
mcat 4 0x94214000b5080813 0x15ba7c00 0x225385200000000
Some example mcat output:
mcat 4 0x94214000b5080a13 0x3a1d8500 0x1be542027ef95c
Processor Number : 0
Bank Number : 4
Time Stamp (0x): 00000000 00000000
Error Status (0x): 94214000 B5080A13
Error Address (0x): 00000000 3A1D8500
Error Misc (0x): 001BE542 027EF95C
Status Bit Decode :
Correctable ECC error
Error address valid
Error Code (0x): 0A13
Error Type - Bus
Participation Processor (PP) - Local node responded to the request (RES)
Timeout (T) - Request did not time out
Memory Transaction Type (RRRR) - Generic read (RD)
Memory or IO (II) - Memory Access (MEM)
Cache Level (LL) - Generic, includes L3 cache (LG)
Bank 4 North Bridge Errors:
ECC Error - DRAM ECC error detected in the NB.
Error address at 929 MB rage
Syndrome (0x): B542
Data (0x): 1F
Bitmap (0x): 02
Error on bit(s) (dec): 125
Address decode: 000000003A1D8500
Node ID: 0
Channel Select: 1
Chip Select: 3
Lastly, in Windows it's helpful to create a custom log view that shows serious errors. Then by using this custom log view you can quickly see if you had any serious errors logged. My custom view is set up like this:
- Logged: Any time
- Event logs: System
- Event Sources: BugCheck, eventlog, Eventlog, MemoryDiagnostics-Results, MemoryDiagnostics-Schedule, StartupRepair, WHEA-Logger, amdsata
- Includes Event IDs: 19,1001,6008,1002,1137,1208,1213,1101,1201,1103,11
Update March 2015, a few useful ECC pages I've come across lately:
- How to Check ECC RAM Functionality - Puget Custom Computers
- Advantages of ECC Memory - Puget Custom Computers
- ECC and REG ECC Memory Performance - Puget Custom Computers