Wednesday, January 27, 2010

I see see ECC

I wanted to create a well written post about using ECC RAM, but I just don't have the time.  Therefore you're getting a condensed version that's low on wit and high on content (but not necessarily facts or truths).  Nothing I write here is original (except for any inaccuracies), it's all been plagiarized from the Internet.  Like I said before I don't have the time to properly credit all of my sources and I'm sure I don't remember them all either.

Why create this post?  Because I couldn't find a good website that showed how ECC should be configured for an above average home ECC user like me.  I've found the answers to some of my questions and have taken educated guesses at what other settings should be.  I hope this post can either help someone else out, or at least get someone to point out where I can find an easy to understand authoritative source on these settings.

Basically ECC RAM is able to correct a single bit memory error on the fly (and report on double bit memory errors).  In theory this gives you better system uptime (less BSODs).  The downside to ECC RAM is that you take a 0.5-2% performance hit (depending on the type of app you're running) and it costs more.

According to this article the more ECC features you enable the more of a performance hit you take.  So if you're a gamer you might want to enable just the basic ECC checking.  Whether it costs a little or a lot more depends on timing and how savvy of a shopper you are.

More details on what ECC RAM is and the theory behind it is easily found via Google.  One interesting site is this summary of Google's 2.5-year study of DRAM error rates.

What do you need to run ECC RAM?  Well for starters you must have ECC RAM.  Most RAM is not ECC RAM.  ECC RAM is special.  In my "Gigabyte GA-790FXTA-UD5" motherboard I use this RAM:
CT2KIT25672BA1339 - 4GB kit (2GBx2), 240-pin DIMM , DDR3 PC3-10600 from Crucial.com
http://www.crucial.com/store/partspecs.aspx?IMODULE=CT2KIT25672BA1339
    * Module Size: 4GB kit (2GBx2)
    * Package: 240-pin DIMM
    * Feature: DDR3 1333 (PC3 10600)
    * Specs: DDR3 PC3-10600 • CL=9 • Unbuffered • ECC • DDR3-1333 • 1.5V • 256Meg x 72 

You also need a motherboard that supports ECC RAM.  AFAIK all (or maybe all non-laptop) AMD CPUs support ECC RAM (the ECC RAM controller is actually in the CPU), so every AMD motherboard should support ECC RAM.  Some motherboard manuals will show you all the ECC settings.  However my GA-790FXTA-UD5 manual didn't even mention ECC RAM.  Its website did say that ECC was supported, so I took a gamble and bought this motherboard (which paid off).  My CPU is a "AMD Phenom II X4 905e Deneb 2.5GHz 4 x 512KB L2 Cache 6MB L3 Cache Socket AM3 65W Quad-Core Processor".

So now that you've got an AMD CPU, AMD motherboard and ECC RAM what do you need to do next?  Just slap it all together and turn your computer on and everything should work.  However it's not quite that simple.  Every motherboard I've seen has the ECC function disabled by default.

All (AMD) motherboard ECC settings should be similar.  My ECC settings are found under "MB Intelligent Tweaker (M.I.T.)->DRAM Configuration".  My (mainly plagiarized) definitions of these settings are:
  • DRAM ECC enable =        Turn on ECC
  • DRAM MCE enable    =        Generate Machine Check Exception logs
  • Chip-Killl mode enable =     Can correct some multibit errors.  Aka "4-Bit ECC"
  • DRAM ECC Redirection =        When a single bit ECC error is found write the corrected data back to RAM (i.e. scrub just the location with the ECC error)
  • DRAM background scrubber =     Interval between main memory scrubs (64 bytes)
  • L2 cache background scrubber =    Interval between main L2 cache scrubs (one single L2 cache line tag address)
  • DCache background scrubber =    Interval between data cache scrubs (64 bits)
Below is a list of how I've configured my ECC settings.  Below each setting is a comment why I made each choice.
  • DRAM ECC enable            Enabled
    • Why have ECC RAM if you're not going to use it?
  • DRAM MCE enable            Enabled
    • MCE logs let you keep track of how many ECC errors you have (i.e. how many times ECC silently corrected a single bit error and saved you from a crash)
  • Chip-Killl mode enable        Enabled
    • Correcting multi-bit errors sounds good to me.
  • DRAM ECC Redirection        Enabled
    • See my section below about scrubbing
  • DRAM background scrubber    10.49ms (scrub 4GB in ~187 hours)
    • See my section below about scrubbing
  • L2 cache background scrubber    Disabled
  • DCache background scrubber    Disabled
    • I don't scrub either of my caches since this PDF says it's not worth it.
Back to ECC 101.  When your CPU reads a section of RAM with an single bit error ECC automatically corrects it (and generates a MCE log to tell ya that it corrected it).  By default this corrected RAM value is only given to the CPU, it's not written back to RAM.  I.e. the bad RAM value is not changed!  By turning on "DRAM ECC Redirection" the good (corrected) value is written back to RAM (over the bad value) to correct the bad entry in RAM.  Why they don't do this by default?  I supposed this takes a microsecond longer to do.

The other reason you wouldn't have to correct ECC errors found in RAM is that you can turn on scrubbing.  Scrubbing is basically a background process (normal ECC checks only RAM values that your CPU has requested) that goes through all of your RAM, reading small 64 byte sections of RAM and checking it for ECC errors.  If it finds an error it corrects the error and writes the updated (and corrected) value back to RAM.

The main reason to scrub your RAM is to avoid getting two bit errors.  I.e. a single cosmic ray flips one of your bits.  If you leave your computer on 24x7 and you don't have "DRAM ECC Redirection" enabled you will eventually have another cosmic ray flip another bit in the same "ECC area" as your first bit flip.  With a 2 bit error ECC can't fix it, only report it (and freeze your computer).  In summary , scrubbing greatly decreases the odds that you'll get a two bit ECC error.

Note that with Windows Vista and Windows 7 by default your computer sleeps when you shut it down.  In sleep mode your RAM is still powered and therefore still capable of taking errors.  I.e. as far as your RAM is concerned,if you use sleep mode your computer is on 24x7 and you should be scrubbing your RAM.  If you truly power off your computer or reboot it regularly (like Windows Update is known to do) you probably don't need to scrub your RAM.

I configure my system to scrub 64bytes (the amount of RAM scrubbed is not configurable) of RAM every 10.49ms.  This will scrub my entire 4GB of RAM in ~187 hours.  That may seem like a long time, but you need to remember that ECC errors are pretty rare anyway.  Since you really only need to be scared of two bit errors, the odds of getting any two bit errors are pretty slim.  Additionally since I have "DRAM ECC Redirection" enabled I'm in effect running another scrubber.

Another random source I found useful is the "AMD Hammer Family Processor BIOS and Kernel Developer's Guide."  It's not the easiest read, but there's some good info in there.


Now you know how/why I configure my ECC RAM to work for me.  I'm not an authoritative source on ECC RAM.  My settings may not be the best for you.  They may not even be the best for me!   With that dire warning aside, there is one more thing I should mention.

How do you monitor your ECC RAM?  How do you know that it's actually working and correcting errors?  Via the magic of MCE (Machine Check Exception) logs of course!

In Linux the MCE logs are in "/var/log/mcelog".  Fedora 12 (64 bit) writes the MCE logs to disk once an hour in this cron job "/etc/cron.hourly/mcelog.cron". If you're having system stability issues (i.e. your computer freezes before the MCE logs get written to disk) you could delete the hourly cron job and instead add this job to "/etc/cronttab" so that the MCE logs would get written out every minute:
* * * * * root /usr/sbin/mcelog --ignorenodev --filter >> /var/log/mcelog
This site has more good Linux info:

Here's what an corrected ECC error  log entry might look like from "/var/log/mcelog":
MCE 0
Fri Oct 12 22:11:47 1492
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge MISC c008000101000000 ADDR afe00
  Northbridge RAM Chipkill ECC error
  Chipkill ECC syndrome = b542
       bit46 = corrected ecc error
       bit59 = misc error valid
  bus error 'local node origin, request didn't time out
             generic read mem transaction
             memory access, level generic'
STATUS 9c214000b5080813 MCGSTATUS 0
CPUID Vendor AMD Family 16 Model 4
WARNING: SMBIOS data is often unreliable. Take with a grain of salt!
DDR DIMM 1333 Mhz Synchronous Width 64 Data Width 64 Size 2 GB
Device Locator: DIMM3
Bank Locator: BANK3
Manufacturer: Manufacturer03
Serial Number: SerNum03
Asset Tag: AssetTagNum3
Part Number: ModulePartNumber03

Note that this Linux mcelog reports the error in "DIMM3/BANK3".  It's important to know that Linux starts counting your memory banks at 0.  So if you have four RAM slots Linux numbers them from 0-3.  I.e. "DIMM3/BANK3" is actually your 4th memory bank.  As you'll see below Windows (unlike Linux) starts counting at 1, so this same error in Windows would be reported as coming from "Bank Number: 4".

In Windows you'll get a log like this in your system event log:
Log Name:      System
Source:        Microsoft-Windows-WHEA-Logger
Date:          10/12/1492 11:20:48 AM
Event ID:      19
Task Category: None
Level:         Warning
Keywords:      
User:          LOCAL SERVICE
Computer:      hal
Description:
A corrected hardware error occurred.  
Error Source: Corrected Machine Check
Error Type: Bus/Interconnect Error
Processor ID Valid: Yes
Processor ID: 0x0
Bank Number: 4
Transaction Type: N/A
Processor Participation: Local node responded to the request
Request Type: Generic Read
Memory/Io: Memory
Memory Hierarchy Level: Generic
Timeout: No

Note how it's even nice enough to tell you which memory bank had the error?  If you start getting a lot of MCE logs from one bank, maybe you simply have a stick of RAM going bad, get it replaced.

In Windows you can use the "AMD MCAT Machine Check Analysis Tool" to get more details on the MCE logs.
Machine Check Analysis Tool (MCAT) is a command line utility that takes a Windows System Event Log (.evt) file as an argument and decodes the MCA Error logs into human readable format. MCAT can alternatively take as an argument the raw register hexadecimal values from an MCE Error.

When I used this on Vista it couldn't read the event log directly. However I could manually give it the numbers from the event log XML details for it to decode. I used commands like this:
mcat /cmd 4 0xf4782000e0080a13 0x90b2c700 0x2231eaa00000000
mcat 4 0x94744000eb080a13 0x30a4bc0 0x226dbd200000000
mcat 4 0x94214000b5080813 0x15ba7c00 0x225385200000000

Some example mcat output:
mcat 4 0x94214000b5080a13 0x3a1d8500 0x1be542027ef95c
Processor Number  : 0
Bank Number       : 4
Time Stamp    (0x): 00000000 00000000
Error Status  (0x): 94214000 B5080A13
Error Address (0x): 00000000 3A1D8500
Error Misc    (0x): 001BE542 027EF95C
Status Bit Decode :
   Correctable ECC error
   Error address valid
   Error enable
   Error valid
Error Code    (0x): 0A13
   Error Type - Bus
   Participation Processor (PP) - Local node responded to the request (RES)
   Timeout (T) - Request did not time out
   Memory Transaction Type (RRRR) - Generic read (RD)
   Memory or IO (II) - Memory Access (MEM)
   Cache Level (LL) - Generic, includes L3 cache (LG)
Bank 4 North Bridge Errors:
   ECC Error - DRAM ECC error detected in the NB.
   Error address at 929 MB rage
   Syndrome  (0x): B542
      Data   (0x): 1F
      Bitmap (0x): 02
      Error on bit(s) (dec): 125 
   Address decode: 000000003A1D8500
      Node ID: 0
      Channel Select: 1
      Chip Select: 3



Lastly, in Windows it's helpful to create a custom log view that shows serious errors. Then by using this custom log view you can quickly see if you had any serious errors logged. My custom view is set up like this:
  • Logged: Any time
  • Event logs: System
  • Event Sources:  BugCheck, eventlog, Eventlog, MemoryDiagnostics-Results, MemoryDiagnostics-Schedule, StartupRepair, WHEA-Logger, amdsata
  • Includes Event IDs: 19,1001,6008,1002,1137,1208,1213,1101,1201,1103,11


Update March 2015, a few useful ECC pages I've come across lately:

12 comments:

Anonymous said...

Don't know if you'll ever read this comment but I just want to say thanks for providing concrete information on a consumer mobo's ECC capability.

Going by the BIOS screenshots in their manuals, I too doubted Gigabyte's claim of ECC support. I was almost ready to go for an Asus board (because they do show ECC options in their manuals), but Asus seem to be letting quality slip judging by all the negative product reviews they've been getting on newegg, etc.

Seems like unless you're willing to spend serious money on a server board, being sure of ECC support in a regular desktop mobo is a bit of a gamble.

Anyway, thanks again. I can finally replace this old-but-reliable dual Athlon/ECC system from 2002 :D

Jim said...

Glad I could help. Funny you should mention Asus. I actually tried a Asus board before getting this one. My last post before this one chronicles my fight with stability issues with that Asus board. I haven't had a single issue since going Gigabyte.

It is strange that Gigabyte doesn't provide more details on their ECC support. It's almost like they're trying to hide it.

It sounds like you upgrade your desktop about as often I as do. This new build replaces my ancient 1.4GHz Athlon system from 2001. Needless to say, the speed improvement is quite noticeable. :)

Anonymous said...

"/var/log/mcelog" is typically create by the user space backend mcelog.

So, if it's missing in your default installation you can get it! For example in gentoo you need app-admin/mcelog and a suitable kernel.

Anonymous said...

Thanks for the post! I am looking at this same motherboard.

I'd like to put 8Gb of RAM in there, preferably ECC. However the qualified vendor list doesn't mention any ECC RAM at all, much less 8GB RAM (even a 4x2Gb). So it's a crapshoot! :-) I see you put in Crucial probably 2X2Gb?

Here is the link to the QVL:
http://www.gigabyte.us/FileList/MemorySupport/mb_memory_ga-790fxta-ud5.pdf

Anonymous said...

So in your opinion, if you turn the computer off (completely off) on a daily basis, is ECC RAM even needed? You can get higher speed DDR 1600 non-ECC RAM for this motherboard, which is on the QVL and potentially cheaper. Your thoughts?

Btw, that ECC RAM from Crucial is $129 at Provantage.com. That's not too bad. I have a question into Crucial whether it will work as two kits for 8Gb total.

Anonymous said...

I posted the last two anonymous posts. Btw, the non-ECC RAM I've been looking at is from G.SKILL. Specs are:

G.SKILL Ripjaws Series 8GB (4 x 2GB) 240-Pin DDR3 SDRAM DDR3 1600 (PC3 12800) Desktop Memory Model F3-12800CL9Q-8GBRL CL 9-9-9-24-2N, Not ECC.

G-Skill says (a) any CL8 or CL9 DDR3 1600 RAM will work with this motherboard, or use any DDR1330; (b) that the 4X2Gb are tested to work together as opposed to two sets of 2X2Gb (hence my concern about buying two kits of 2x2Gb ECC RAM); (3) that 8Gb could certaintly be utilized by Hyper-V and
other virtualization products; and (4) "If you have an AMD Phenom II 955 or 965 C3 revision, then you should not have a problem with 4
modules at DDR3-1600. Otherwise, you may need to downclock to DDR3-1333 or use two 4GB modules."

Jim said...

Whether or not your need ECC RAM is a subject of much debate (which you can find with Google). I am not an authoritative source, but since you asked for my two cents here they are.

IMHO you don't need ECC RAM if you turn your PC off every night. You don't really need it even if your PC is on 24x7 like mine is. I haven't taken a single ECC error with this RAM and motherboard and I've had it running 24x7 for a few months now.

Having said that, I'm obviously a fan of ECC RAM and still plan on using it for all my computers. It just depends on your level of paranoia. If your goal is maximum speed, you don't want ECC RAM. If your goal is maximum stability, then you want ECC RAM. If your goal is anywhere in between, then the answer is less clear cut. I value stability over speed (I don't play games on my PC), hence my ECC RAM.

Regarding the RAM QVL, that's always a concern. However, I did have to open a ticket with Gigabyte regarding a resume from sleep issue and an ECC setting. Gigabyte never made an issue about my non-QVL RAM. They even sent me a custom BIOS to resolve the issue (the fix is also included in the public f3d BIOS release). So my experience with non-QVL RAM has been good, but YMMV.

I've also been running 8GB of my ECC RAM (4x2GB) for about a month now without any issues. Regarding 4 sticks at high speed, I'd follow whatever G-Skill says. Not an issue for me since my RAM is slower.

bonso said...

Hi there,
A good write up, I'm currious about where you found information about how to set the DRAM background scrub timer? Reading your post I think I've set mine way to high, 84ms for 8GB worth of RAM, unless I keep the machine running for ages...
Thanks!

Jim said...

I set my time based on what I think my ECC error rate will be. If you know that then setting your scrub time is actually very easy. For example if I know that my error rate per 4GB of RAM is one error every 200 hours, then I probably want my entire RAM to be scrubbed every 200 hours.

So calculating your scrub rate should also be easy, all we need to know is what your error rate is. Before we get to your error rate, lets determine how long it takes to scrub your 8GB of RAM. By my math a 10.49ms scrub time scrubs 4GB of RAM in ~187 hours. Therefore 8GB at 10.49ms would get done in roughly 374 hours. Since your scrubbing 8 times slower (84/10.49 = 8), 8*374 = 2,992 hours for you to scrub all 8GB of your RAM. 2,992 hours = 125 days or a little over 4 months. Therefore if the error rate for your 8GB of RAM is > 4 months you've chosen a good scrub rate.

See, that wasn't hard at all. Except for the math. Math is hard. You probably want to double check my arithmetic.

Now all we need to know is your ECC error rate. Fortunately Google gives us a nice concise range of estimates. For example from Wikipedia "Recent tests {[7],[4],[5]} give widely varying error rates with over 7 orders of magnitude difference, ranging from 10−10 to 10−17 error/bit·h, roughly one bit error, per hour, per gigabyte of memory to one bit error, per century, per gigabyte of memory."

So if you assume 1 error per GB per hour, yeah your time is way too slow. If you believe your error rate to be 1 error per GB per century, then you don't have anything to worry about (and could set your time much higher or even disable it). Don't you feel much better now that you have a concrete definitive answer to what your scrub rate should be? :)

FWIW I personally have yet to see a ECC error on my setup that's been running 24x7 for a few months now (with 8GB of RAM instead of 4GB for about half of that time). According to this summary of the Google on ECC RAM study they found that "error rates were motherboard, not DIMM type or vendor, dependent." So maybe I just got a good motherboard and I can go a century before I take an error?

If I go a year without an ECC error I might increase my scrub time (and do less scrubbing) or even disable scrubbing altogether. Since Windows Update reboots my Windows machine monthly my entire RAM is essentially getting scrubbed (i.e. wiped of all RAM errors) once a month anyway. Therefore I can probably avoid any double ECC errors simply by having a monthly reboot instead of scrubbing. YMMV.

Anonymous said...

Thank you for posting this. Your blog entry prompted me to more intensively find out how to enable ECC in my Gigabyte GA-M59SLI-S5 motherboard.

The GA-M59SLI-S5 BIOS also has an M.I.T. submenu, but the ECC settings are not found there. Instead, ECC is configured in a hidden submenu "Advanced Chipset Settings". The submenu can be unhidden by hitting Ctrl-F1 at the main BIOS screen.

The options in that submenu are similar to what you have described in your board, but with some differences in settings and impact on performance.

(Re: your comment in post #2 about Gigabyte trying to hide it -- they did!)

Jim said...

Hi Anonymous, long time no comment! :) I thought my blog would email with new comments, I really should figure out how to enable that. Otherwise my current process of checking this once every 2-3 years does work.

Thanks for sharing the location of your Gigabyte ECC Easter Egg. Maybe that's why they hide it, to make it more fun for us to find it? I can't think of a better explanation. :)

Jim said...

This may be useful:

[How WHEA Performs PFA on ECC Memory (Windows Drivers)]
http://msdn.microsoft.com/en-us/library/ff559390(v=vs.85).aspx