Wednesday, January 27, 2010

I see see ECC

I wanted to create a well written post about using ECC RAM, but I just don't have the time.  Therefore you're getting a condensed version that's low on wit and high on content (but not necessarily facts or truths).  Nothing I write here is original (except for any inaccuracies), it's all been plagiarized from the Internet.  Like I said before I don't have the time to properly credit all of my sources and I'm sure I don't remember them all either.

Why create this post?  Because I couldn't find a good website that showed how ECC should be configured for an above average home ECC user like me.  I've found the answers to some of my questions and have taken educated guesses at what other settings should be.  I hope this post can either help someone else out, or at least get someone to point out where I can find an easy to understand authoritative source on these settings.

Basically ECC RAM is able to correct a single bit memory error on the fly (and report on double bit memory errors).  In theory this gives you better system uptime (less BSODs).  The downside to ECC RAM is that you take a 0.5-2% performance hit (depending on the type of app you're running) and it costs more.

According to this article the more ECC features you enable the more of a performance hit you take.  So if you're a gamer you might want to enable just the basic ECC checking.  Whether it costs a little or a lot more depends on timing and how savvy of a shopper you are.

More details on what ECC RAM is and the theory behind it is easily found via Google.  One interesting site is this summary of Google's 2.5-year study of DRAM error rates.

What do you need to run ECC RAM?  Well for starters you must have ECC RAM.  Most RAM is not ECC RAM.  ECC RAM is special.  In my "Gigabyte GA-790FXTA-UD5" motherboard I use this RAM:
CT2KIT25672BA1339 - 4GB kit (2GBx2), 240-pin DIMM , DDR3 PC3-10600 from Crucial.com
http://www.crucial.com/store/partspecs.aspx?IMODULE=CT2KIT25672BA1339
    * Module Size: 4GB kit (2GBx2)
    * Package: 240-pin DIMM
    * Feature: DDR3 1333 (PC3 10600)
    * Specs: DDR3 PC3-10600 • CL=9 • Unbuffered • ECC • DDR3-1333 • 1.5V • 256Meg x 72 

You also need a motherboard that supports ECC RAM.  AFAIK all (or maybe all non-laptop) AMD CPUs support ECC RAM (the ECC RAM controller is actually in the CPU), so every AMD motherboard should support ECC RAM.  Some motherboard manuals will show you all the ECC settings.  However my GA-790FXTA-UD5 manual didn't even mention ECC RAM.  Its website did say that ECC was supported, so I took a gamble and bought this motherboard (which paid off).  My CPU is a "AMD Phenom II X4 905e Deneb 2.5GHz 4 x 512KB L2 Cache 6MB L3 Cache Socket AM3 65W Quad-Core Processor".

So now that you've got an AMD CPU, AMD motherboard and ECC RAM what do you need to do next?  Just slap it all together and turn your computer on and everything should work.  However it's not quite that simple.  Every motherboard I've seen has the ECC function disabled by default.

All (AMD) motherboard ECC settings should be similar.  My ECC settings are found under "MB Intelligent Tweaker (M.I.T.)->DRAM Configuration".  My (mainly plagiarized) definitions of these settings are:
  • DRAM ECC enable =        Turn on ECC
  • DRAM MCE enable    =        Generate Machine Check Exception logs
  • Chip-Killl mode enable =     Can correct some multibit errors.  Aka "4-Bit ECC"
  • DRAM ECC Redirection =        When a single bit ECC error is found write the corrected data back to RAM (i.e. scrub just the location with the ECC error)
  • DRAM background scrubber =     Interval between main memory scrubs (64 bytes)
  • L2 cache background scrubber =    Interval between main L2 cache scrubs (one single L2 cache line tag address)
  • DCache background scrubber =    Interval between data cache scrubs (64 bits)
Below is a list of how I've configured my ECC settings.  Below each setting is a comment why I made each choice.
  • DRAM ECC enable            Enabled
    • Why have ECC RAM if you're not going to use it?
  • DRAM MCE enable            Enabled
    • MCE logs let you keep track of how many ECC errors you have (i.e. how many times ECC silently corrected a single bit error and saved you from a crash)
  • Chip-Killl mode enable        Enabled
    • Correcting multi-bit errors sounds good to me.
  • DRAM ECC Redirection        Enabled
    • See my section below about scrubbing
  • DRAM background scrubber    10.49ms (scrub 4GB in ~187 hours)
    • See my section below about scrubbing
  • L2 cache background scrubber    Disabled
  • DCache background scrubber    Disabled
    • I don't scrub either of my caches since this PDF says it's not worth it.
Back to ECC 101.  When your CPU reads a section of RAM with an single bit error ECC automatically corrects it (and generates a MCE log to tell ya that it corrected it).  By default this corrected RAM value is only given to the CPU, it's not written back to RAM.  I.e. the bad RAM value is not changed!  By turning on "DRAM ECC Redirection" the good (corrected) value is written back to RAM (over the bad value) to correct the bad entry in RAM.  Why they don't do this by default?  I supposed this takes a microsecond longer to do.

The other reason you wouldn't have to correct ECC errors found in RAM is that you can turn on scrubbing.  Scrubbing is basically a background process (normal ECC checks only RAM values that your CPU has requested) that goes through all of your RAM, reading small 64 byte sections of RAM and checking it for ECC errors.  If it finds an error it corrects the error and writes the updated (and corrected) value back to RAM.

The main reason to scrub your RAM is to avoid getting two bit errors.  I.e. a single cosmic ray flips one of your bits.  If you leave your computer on 24x7 and you don't have "DRAM ECC Redirection" enabled you will eventually have another cosmic ray flip another bit in the same "ECC area" as your first bit flip.  With a 2 bit error ECC can't fix it, only report it (and freeze your computer).  In summary , scrubbing greatly decreases the odds that you'll get a two bit ECC error.

Note that with Windows Vista and Windows 7 by default your computer sleeps when you shut it down.  In sleep mode your RAM is still powered and therefore still capable of taking errors.  I.e. as far as your RAM is concerned,if you use sleep mode your computer is on 24x7 and you should be scrubbing your RAM.  If you truly power off your computer or reboot it regularly (like Windows Update is known to do) you probably don't need to scrub your RAM.

I configure my system to scrub 64bytes (the amount of RAM scrubbed is not configurable) of RAM every 10.49ms.  This will scrub my entire 4GB of RAM in ~187 hours.  That may seem like a long time, but you need to remember that ECC errors are pretty rare anyway.  Since you really only need to be scared of two bit errors, the odds of getting any two bit errors are pretty slim.  Additionally since I have "DRAM ECC Redirection" enabled I'm in effect running another scrubber.

Another random source I found useful is the "AMD Hammer Family Processor BIOS and Kernel Developer's Guide."  It's not the easiest read, but there's some good info in there.


Now you know how/why I configure my ECC RAM to work for me.  I'm not an authoritative source on ECC RAM.  My settings may not be the best for you.  They may not even be the best for me!   With that dire warning aside, there is one more thing I should mention.

How do you monitor your ECC RAM?  How do you know that it's actually working and correcting errors?  Via the magic of MCE (Machine Check Exception) logs of course!

In Linux the MCE logs are in "/var/log/mcelog".  Fedora 12 (64 bit) writes the MCE logs to disk once an hour in this cron job "/etc/cron.hourly/mcelog.cron". If you're having system stability issues (i.e. your computer freezes before the MCE logs get written to disk) you could delete the hourly cron job and instead add this job to "/etc/cronttab" so that the MCE logs would get written out every minute:
* * * * * root /usr/sbin/mcelog --ignorenodev --filter >> /var/log/mcelog
This site has more good Linux info:

Here's what an corrected ECC error  log entry might look like from "/var/log/mcelog":
MCE 0
Fri Oct 12 22:11:47 1492
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge MISC c008000101000000 ADDR afe00
  Northbridge RAM Chipkill ECC error
  Chipkill ECC syndrome = b542
       bit46 = corrected ecc error
       bit59 = misc error valid
  bus error 'local node origin, request didn't time out
             generic read mem transaction
             memory access, level generic'
STATUS 9c214000b5080813 MCGSTATUS 0
CPUID Vendor AMD Family 16 Model 4
WARNING: SMBIOS data is often unreliable. Take with a grain of salt!
DDR DIMM 1333 Mhz Synchronous Width 64 Data Width 64 Size 2 GB
Device Locator: DIMM3
Bank Locator: BANK3
Manufacturer: Manufacturer03
Serial Number: SerNum03
Asset Tag: AssetTagNum3
Part Number: ModulePartNumber03

Note that this Linux mcelog reports the error in "DIMM3/BANK3".  It's important to know that Linux starts counting your memory banks at 0.  So if you have four RAM slots Linux numbers them from 0-3.  I.e. "DIMM3/BANK3" is actually your 4th memory bank.  As you'll see below Windows (unlike Linux) starts counting at 1, so this same error in Windows would be reported as coming from "Bank Number: 4".

In Windows you'll get a log like this in your system event log:
Log Name:      System
Source:        Microsoft-Windows-WHEA-Logger
Date:          10/12/1492 11:20:48 AM
Event ID:      19
Task Category: None
Level:         Warning
Keywords:      
User:          LOCAL SERVICE
Computer:      hal
Description:
A corrected hardware error occurred.  
Error Source: Corrected Machine Check
Error Type: Bus/Interconnect Error
Processor ID Valid: Yes
Processor ID: 0x0
Bank Number: 4
Transaction Type: N/A
Processor Participation: Local node responded to the request
Request Type: Generic Read
Memory/Io: Memory
Memory Hierarchy Level: Generic
Timeout: No

Note how it's even nice enough to tell you which memory bank had the error?  If you start getting a lot of MCE logs from one bank, maybe you simply have a stick of RAM going bad, get it replaced.

In Windows you can use the "AMD MCAT Machine Check Analysis Tool" to get more details on the MCE logs.
Machine Check Analysis Tool (MCAT) is a command line utility that takes a Windows System Event Log (.evt) file as an argument and decodes the MCA Error logs into human readable format. MCAT can alternatively take as an argument the raw register hexadecimal values from an MCE Error.

When I used this on Vista it couldn't read the event log directly. However I could manually give it the numbers from the event log XML details for it to decode. I used commands like this:
mcat /cmd 4 0xf4782000e0080a13 0x90b2c700 0x2231eaa00000000
mcat 4 0x94744000eb080a13 0x30a4bc0 0x226dbd200000000
mcat 4 0x94214000b5080813 0x15ba7c00 0x225385200000000

Some example mcat output:
mcat 4 0x94214000b5080a13 0x3a1d8500 0x1be542027ef95c
Processor Number  : 0
Bank Number       : 4
Time Stamp    (0x): 00000000 00000000
Error Status  (0x): 94214000 B5080A13
Error Address (0x): 00000000 3A1D8500
Error Misc    (0x): 001BE542 027EF95C
Status Bit Decode :
   Correctable ECC error
   Error address valid
   Error enable
   Error valid
Error Code    (0x): 0A13
   Error Type - Bus
   Participation Processor (PP) - Local node responded to the request (RES)
   Timeout (T) - Request did not time out
   Memory Transaction Type (RRRR) - Generic read (RD)
   Memory or IO (II) - Memory Access (MEM)
   Cache Level (LL) - Generic, includes L3 cache (LG)
Bank 4 North Bridge Errors:
   ECC Error - DRAM ECC error detected in the NB.
   Error address at 929 MB rage
   Syndrome  (0x): B542
      Data   (0x): 1F
      Bitmap (0x): 02
      Error on bit(s) (dec): 125 
   Address decode: 000000003A1D8500
      Node ID: 0
      Channel Select: 1
      Chip Select: 3



Lastly, in Windows it's helpful to create a custom log view that shows serious errors. Then by using this custom log view you can quickly see if you had any serious errors logged. My custom view is set up like this:
  • Logged: Any time
  • Event logs: System
  • Event Sources:  BugCheck, eventlog, Eventlog, MemoryDiagnostics-Results, MemoryDiagnostics-Schedule, StartupRepair, WHEA-Logger, amdsata
  • Includes Event IDs: 19,1001,6008,1002,1137,1208,1213,1101,1201,1103,11


March 2015 update, a few useful ECC pages I've come across lately:

30JAN2019 update.  I'm now running CentOS 7.6 on my same ancient hardware.  With CentOS 7 mcelog no longer supports my AMD CPU.  Per this article the mcelog package isn't used by AMD processors.  If you have mcelog installed with an AMD processor you'll get messages like these:
    grep mce /var/log/messages
    mcelog: ERROR: AMD Processor family 16: mcelog does not support this processor.  Please use the edac_mce_amd module instead.#012: Success
    : CPU is unsupported
The easiest way to get rid of these messages is to simply remove the mcelog package:
yum erase mcelog

And install this instead:
yum install edac-utils

To diagnose ECC errors with EDAC run these commands:
edac-util
edac-util --report=full
grep "\[Hardware Error\]\|EDAC" /var/log/messages



Some sample output below from my PC shows my DIMM in row 1 is giving me the CE (correctable errors).  Since they were corrected by my ECC RAM they didn't cause a problem.  I was able to replace my bad stick of RAM before things got bad (i.e. before I had any UE/Uncorrectable Errors).

Note that EDAC starts counting at 0, so csrow#1 actually corresponds to what my motherboard has labeled as DIMM slot 2.

Another thing to note is that the [Hardware Error] entries in /var/log/messages as also printed to any logged in terminal session (which is how I first became aware of this issue).
 
[jim@c64 ~]# edac-util
mc0: csrow1: mc#0csrow#1channel#1: 1 Corrected Errors

[jim@c64 ~]$ edac-util --report=full
mc0:csrow0:mc#0csrow#0channel#0:CE:0
mc0:csrow0:mc#0csrow#0channel#1:CE:0
mc0:csrow1:mc#0csrow#1channel#0:CE:0
mc0: csrow1: mc#0csrow#1channel#1: 1 Corrected Errors
mc0:csrow1:mc#0csrow#1channel#1:CE:1
mc0:csrow2:mc#0csrow#2channel#0:CE:0
mc0:csrow2:mc#0csrow#2channel#1:CE:0
mc0:csrow3:mc#0csrow#3channel#0:CE:0
mc0:csrow3:mc#0csrow#3channel#1:CE:0

[root@c64 ~]# grep "\[Hardware Error\]\|EDAC" /var/log/messages
Jan 30 05:16:59 c64 kernel: EDAC MC: Ver: 3.0.0
Jan 30 11:17:05 c64 kernel: AMD64 EDAC driver v3.4.0
Jan 30 11:17:05 c64 kernel: EDAC amd64: DRAM ECC enabled.
Jan 30 11:17:05 c64 kernel: EDAC amd64: F10h detected (node 0).
Jan 30 11:17:05 c64 kernel: EDAC amd64: MC: 0:  2048MB 1:  2048MB
Jan 30 11:17:05 c64 kernel: EDAC amd64: MC: 2:  2048MB 3:  2048MB
Jan 30 11:17:05 c64 kernel: EDAC amd64: MC: 4:     0MB 5:     0MB
Jan 30 11:17:05 c64 kernel: EDAC amd64: MC: 6:     0MB 7:     0MB
Jan 30 11:17:05 c64 kernel: EDAC amd64: MC: 0:  2048MB 1:  2048MB
Jan 30 11:17:05 c64 kernel: EDAC amd64: MC: 2:  2048MB 3:  2048MB
Jan 30 11:17:05 c64 kernel: EDAC amd64: MC: 4:     0MB 5:     0MB
Jan 30 11:17:05 c64 kernel: EDAC amd64: MC: 6:     0MB 7:     0MB
Jan 30 11:17:05 c64 kernel: EDAC amd64: using x4 syndromes.
Jan 30 11:17:05 c64 kernel: EDAC amd64: MCT channel count: 2
Jan 30 11:17:05 c64 kernel: EDAC MC0: Giving out device to 'amd64_edac' 'F10h': DEV 0000:00:18.3
Jan 30 11:17:05 c64 kernel: EDAC PCI0: Giving out device to module 'amd64_edac' controller 'EDAC PCI controller': DEV '0000:00:18.2' (POLLED)
Jan 30 11:21:58 c64 kernel: mce: [Hardware Error]: Machine check events logged
Jan 30 11:21:58 c64 kernel: [Hardware Error]: Corrected error, no action required.
Jan 30 11:21:58 c64 kernel: [Hardware Error]: CPU:0 (10:4:3) MC4_STATUS[-|CE|MiscV|-|AddrV|CECC]: 0x9c00c00041080813
Jan 30 11:21:58 c64 kernel: [Hardware Error]: Error Addr: 0x000000037ceb8980
Jan 30 11:21:58 c64 kernel: [Hardware Error]: MC4 Error (node 0): DRAM ECC error detected on the NB.
Jan 30 11:21:58 c64 kernel: EDAC MC0: 1 CE on mc#0csrow#1channel#1 (csrow:1 channel:1 page:0x37ceb8 offset:0x980 grain:0 syndrome:0x4101)
Jan 30 11:21:58 c64 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)