Using bad RAM

Update: I dreamed it and Rick van Rein implemented it. To be fair he did not wait for me to get the idea. Still it was quite a surprise to read an announcement about his module on kernel traffic just a few weeks after I wrote this page.

Have a look at his web site: Linux kernel support for broken RAM modules

At the office we had an NT machine on which some processes crashed inexplicably (well, it was worse than you would expect even from NT). I suspected a bad memory chip and indeed that was it: I ran memtest86 on it and found that a few memory addresses would return inconsistent results.

Of course NT never told us anything. Some applications would inexplicably crash and sometimes the NT itself would crash. Of course neither of these is particularly unusual so we kept using it for months. when we realized what was happening we exchanged the memory chip with that of an unused machine. Soon after the filesystem of that other machine got corrupted. We were lucky!

So first if you suspect you have a bad memory chip use memtest86.

Then I would have expected the OS to warn me about that kind of thing. I'm not too surprised that NT did not warn me. But how would Linux fare? I did not get the chance (or bad luck) to find out. Maybe with regular RAM there is nothing that can be done. Linux would probably emit and 'oops', try to continue and crash eventually. At least you know that Linux is not supposed to issue 'oppses' so you would get memtest86 and check it out.

But with ECC RAM and chipset support it should be possible to do better than that. First I would expect the chipset should notify that it had to error correct some memory read. The Os would then issue a warning message that the memory may be rotting. Then if error correction is not possible, the chipset could signal a memory read failure. This would cause some sort of interrupt routine to be activated, hopefully from an address that still works, so that the OS gets a chance to warn the administrator and shutdown. In fact the OS may even be able to take more sophisticated actions.

if the error occurred in a page that can be reloaded from disk, mark the page as bad, stop using it, and reload the data somewhere else in memory.
if the error occurred in a page that cannot be reloaded but belongs to a process, kill that process, mark the page as bad and stop using it.
if the error occurred about anywhere else freeze!

Of course in all cases every effort should be made to notify the administrator should be notified.

Then there's the question of what to do with this bad SIMM/DIMM. The standard answer is to throw away the whole thing. But I think the answer should be more nuanced.

The documentation of memtest86 suggests that it is quite frequent that just a few memory locations go bad: some gates in the memory were on the edge and due to aging, overheating, electrical overload they just went over the edge while all the others are just fine. In such a case the defect is usually localized to the bits stored physically in a small area of just one of the memory chips. This would be a bit like an LCD screen that has some have defects: two or three pixels that don't work. This does not mean that the rest works unreliably or that you throw away the whole screen (actually the manufacturing plant will throw away an LCD screen with more than three defective pixels). This seems to be exactly what happened to our 128MB DIMM: according to memtest86 just a few memory locations, less than 64KB or 0.05%, does not work. It would be such a waste to throw it away! There must be a better way but what?

Well, the simplest solution would be not to use the memory locations that don't work. This could be done by writing a Linux driver that allocates a list of specified memory ranges so that they are not used by Linux. The list of memory ranges would be provided by kernel options, using lilo for instance, in the form 'badram=231e000-2326000'. The driver would allocate these memory ranges as soon as possible so that noone else ever has a chance to use them.

Still once a memory chip has turned bad on you even so slightly it's hard to trust it anymore. I think I would continue using such a memory chip in a 'personnal' workstation or on my home computer but I sure would not recommend using it in anything that is 'mission critical' or that looks like a server.

So I devised other ways to use memory even if you think any memory location might fail. It's more complex, less efficient and requires that you have another memory SIMM/DIMM that does work fine.

You can use it as a read-only cache. Say you have two 128MB DIMMS in your computer. One works fine and the other returns random errors. You would configure Linux to only use the 128MB that works and the other 128MB would be used by a special driver that would handle it as a disk cache. When Linux needs to free a clean page (i.e. a page which is already stored on disk), it would compute a CRC and store it in a safe location and then copy it to some place in the 'upper' memory (the one with the bad memory chip). Then when that page is needed again the driver would transfer the data back from 'upper' memory to regular memory. At the same time it would recompute the CRC and verify that it matches the one that was stored in regular memory. If all is fine we got our data back and we did not have to touch the disk. Otherwise we will have to reload the page from disk.

You could also do this but with some kind of 'RAID 5' mechanism on top of it. This would allow you to also store stuff which is not on disk. The drawback is even more CPU and memory bandwidth usage.

In either case you have to determine what to do with the 'upper' memory location that generated errors. If you suppose that these memory locations are just bad and will never work then the driver should mark them as such and make sure not to use them anymore. Then after some use the driver will have automagically mapped which memory areas are good and which are bad.

But if any memory location has an error rate of say, 1e-7, then this is not the right thing to do as you would quite rapidly mark all of it as unusable. So in that case the best is to keep reusing all of these memory locations, maybe try to keep track of the page error rate to stop using pages that have an error rate which is too high.

So is this feasible? I think so. The 'mark the bad spots' scheme should be rather simple to implement. The last two schemes are more complex. But if software RAID 5 performs adequately then they too should work fine. In fact there is a patch, SLRAM, on Linux MM, which is designed to handle non cachable memory as a swap area. That's kind of similar.

Is anyone interested in implementing any of this?

Home/Divers Wine Bing Calcul Distribué Fun Liens

fgouget@free.fr Cette page est hébergée gratuitement par Free.fr