Re: [PATCH v2 0/3] support for broken memory modules (BadRAM)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 22 Jun 2011 11:00:34 -0700 Andrew Morton wrote:

> On Wed, 22 Jun 2011 13:18:51 +0200 Stefan Assmann <sassmann@xxxxxxxxx> wrote:
> 
> > Following the RFC for the BadRAM feature here's the updated version with
> > spelling fixes, thanks go to Randy Dunlap. Also the code is now less verbose,
> > as requested by Andi Kleen.
> > v2 with even more spelling fixes suggested by Randy.
> > Patches are against vanilla 2.6.39.
> > 
> > The idea is to allow the user to specify RAM addresses that shouldn't be
> > touched by the OS, because they are broken in some way. Not all machines have
> > hardware support for hwpoison, ECC RAM, etc, so here's a solution that allows to
> > use bitmasks to mask address patterns with the new "badram" kernel command line
> > parameter.
> > Memtest86 has an option to generate these patterns since v2.3 so the only thing
> > for the user to do should be:
> > - run Memtest86
> > - note down the pattern
> > - add badram=<pattern> to the kernel command line
> > 
> > The concerning pages are then marked with the hwpoison flag and thus won't be
> > used by the memory managment system.
> 
> The google kernel has a similar capability.  I asked Nancy to comment
> on these patches and she said:
> 
> : One, the bad addresses are passed via the kernel command line, which
> : has a limited length.  It's okay if the addresses can be fit into a
> : pattern, but that's not necessarily the case in the google kernel.  And
> : even with patterns, the limit on the command line length limits the
> : number of patterns that user can specify.  Instead we use lilo to pass
> : a file containing the bad pages in e820 format to the kernel.
> : 
> : Second, the BadRAM patch expands the address patterns from the command
> : line into individual entries in the kernel's e820 table.  The e820
> : table is a fixed buffer that supports a very small, hard coded number
> : of entries (128).  We require a much larger number of entries (on
> : the order of a few thousand), so much of the google kernel patch deals
> : with expanding the e820 table. Also, with the BadRAM patch, entries
> : that don't fit in the table are silently dropped and this isn't
> : appropriate for us.
> : 
> : Another caveat of mapping out too much bad memory in general.  If too
> : much memory is removed from low memory, a system may not boot.  We
> : solve this by generating good maps.  Our userspace tools do not map out
> : memory below a certain limit, and it verifies against a system's iomap
> : that only addresses from memory is mapped out.
> 
> I have a couple of thoughts here:
> 
> - If this patchset is merged and a major user such as google is
>   unable to use it and has to continue to carry a separate patch then
>   that's a regrettable situation for the upstream kernel.
> 
> - Google's is, afaik, the largest use case we know of: zillions of
>   machines for a number of years.  And this real-world experience tells
>   us that the badram patchset has shortcomings.  Shortcomings which we
>   can expect other users to experience.
> 
> So.  What are your thoughts on these issues?


Good comments, so where is google's patch submittal?

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>


[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]