On 2015/5/9 0:44, Tony Luck wrote: > Some high end Intel Xeon systems report uncorrectable memory errors > as a recoverable machine check. Linux has included code for some time > to process these and just signal the affected processes (or even > recover completely if the error was in a read only page that can be > replaced by reading from disk). > > But we have no recovery path for errors encountered during kernel > code execution. Except for some very specific cases were are unlikely > to ever be able to recover. > > Enter memory mirroring. Actually 3rd generation of memory mirroing. > > Gen1: All memory is mirrored > Pro: No s/w enabling - h/w just gets good data from other side of the mirror > Con: Halves effective memory capacity available to OS/applications > Gen2: Partial memory mirror - just mirror memory begind some memory controllers > Pro: Keep more of the capacity > Con: Nightmare to enable. Have to choose between allocating from > mirrored memory for safety vs. NUMA local memory for performance > Gen3: Address range partial memory mirror - some mirror on each memory controller > Pro: Can tune the amount of mirror and keep NUMA performance > Con: I have to write memory management code to implement > > The current plan is just to use mirrored memory for kernel allocations. This > has been broken into two phases: > 1) This patch series - find the mirrored memory, use it for boot time allocations > 2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the unused > mirrored memory from mm/memblock.c and only give it out to select kernel > allocations (this is still being scoped because page_alloc.c is scary). > Hi Tony, In part2, does it means the memory allocated from kernel should use mirrored memory? I have heard of this feature(address range mirroring) before, and I changed some code to test it(implement memory allocations in specific physical areas). In my opinion, add a new zone(ZONE_MIRROR) to fill the mirrored memory is not a good idea. If there are XX discontiguous mirrored areas in one numa node, there should be XX ZONE_MIRROR zones in one pgdat, it is impossible, right? I think add a new migrate type(MIGRATE_MIRROR) will be better, the following print is from my changed kernel. [root@localhost ~]# cat /proc/pagetypeinfo Page block order: 9 Pages per block: 512 Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 Node 0, zone DMA, type Unmovable 1 1 1 0 2 1 1 0 1 0 0 Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 0 3 Node 0, zone DMA, type Mirror 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 1 0 Node 0, zone DMA, type CMA 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA32, type Unmovable 14 7 6 1 3 0 1 0 0 0 0 Node 0, zone DMA32, type Reclaimable 15 2 2 1 1 2 1 1 0 0 0 Node 0, zone DMA32, type Movable 3 24 52 58 31 2 1 1 1 3 231 Node 0, zone DMA32, type Mirror 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA32, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node 0, zone DMA32, type CMA 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Unmovable 80 12 6 7 3 1 67 58 23 11 0 Node 0, zone Normal, type Reclaimable 6 6 8 11 5 3 0 1 0 0 0 Node 0, zone Normal, type Movable 6 198 618 675 363 13 4 3 0 2 4074 Node 0, zone Normal, type Mirror 0 0 0 0 0 0 0 0 0 0 1024 Node 0, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node 0, zone Normal, type CMA 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Number of blocks type Unmovable Reclaimable Movable Mirror Reserve CMA Isolate Node 0, zone DMA 1 0 6 0 1 0 0 Node 0, zone DMA32 8 32 975 0 1 0 0 Node 0, zone Normal 216 334 12760 2048 2 0 0 Page block order: 9 Pages per block: 512 Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 Node 1, zone Normal, type Unmovable 18 2 19 3 21 28 13 0 1 1 0 Node 1, zone Normal, type Reclaimable 0 1 1 1 0 0 1 0 0 1 0 Node 1, zone Normal, type Movable 6 13 9 3 0 4 5 0 1 0 6970 Node 1, zone Normal, type Mirror 0 0 0 0 0 0 0 0 0 0 1024 Node 1, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node 1, zone Normal, type CMA 0 0 0 0 0 0 0 0 0 0 0 Node 1, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Number of blocks type Unmovable Reclaimable Movable Mirror Reserve CMA Isolate Node 1, zone Normal 112 4 14218 2048 2 0 0 Also I add a new flag(GFP_MIRROR), then we can use the mirrored form both kernel-space and user-space. If there is no mirrored memory, we will allocate other types memory. 1) kernel-space(pcp, page buddy, slab/slub ...): -> use mirrored memory(e.g. /proc/sys/vm/mirrorable) -> __alloc_pages_nodemask() ->gfpflags_to_migratetype() -> use MIGRATE_MIRROR list 2) user-space(syscall, madvise, mmap ...): -> add VM_MIRROR flag in the vma -> add GFP_MIRROR when page fault in the vma -> __alloc_pages_nodemask() -> use MIGRATE_MIRROR list Thanks, Xishi Qiu > Tony Luck (3): > mm/memblock: Add extra "flags" to memblock to allow selection of > memory based on attribute > mm/memblock: Allocate boot time data structures from mirrored memory > x86, mirror: x86 enabling - find mirrored memory ranges > > arch/s390/kernel/crash_dump.c | 5 +- > arch/sparc/mm/init_64.c | 6 ++- > arch/x86/kernel/check.c | 3 +- > arch/x86/kernel/e820.c | 3 +- > arch/x86/kernel/setup.c | 3 ++ > arch/x86/mm/init_32.c | 2 +- > arch/x86/platform/efi/efi.c | 21 ++++++++ > include/linux/efi.h | 3 ++ > include/linux/memblock.h | 49 +++++++++++------ > mm/cma.c | 6 ++- > mm/memblock.c | 123 +++++++++++++++++++++++++++++++++--------- > mm/memtest.c | 3 +- > mm/nobootmem.c | 14 ++++- > 13 files changed, 188 insertions(+), 53 deletions(-) > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>