Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2015/5/9 0:44, Tony Luck wrote:

> Some high end Intel Xeon systems report uncorrectable memory errors
> as a recoverable machine check. Linux has included code for some time
> to process these and just signal the affected processes (or even
> recover completely if the error was in a read only page that can be
> replaced by reading from disk).
> 
> But we have no recovery path for errors encountered during kernel
> code execution. Except for some very specific cases were are unlikely
> to ever be able to recover.
> 
> Enter memory mirroring. Actually 3rd generation of memory mirroing.
> 
> Gen1: All memory is mirrored
> 	Pro: No s/w enabling - h/w just gets good data from other side of the mirror
> 	Con: Halves effective memory capacity available to OS/applications
> Gen2: Partial memory mirror - just mirror memory begind some memory controllers
> 	Pro: Keep more of the capacity
> 	Con: Nightmare to enable. Have to choose between allocating from
> 	     mirrored memory for safety vs. NUMA local memory for performance
> Gen3: Address range partial memory mirror - some mirror on each memory controller
> 	Pro: Can tune the amount of mirror and keep NUMA performance
> 	Con: I have to write memory management code to implement
> 
> The current plan is just to use mirrored memory for kernel allocations. This
> has been broken into two phases:
> 1) This patch series - find the mirrored memory, use it for boot time allocations
> 2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the unused
>    mirrored memory from mm/memblock.c and only give it out to select kernel
>    allocations (this is still being scoped because page_alloc.c is scary).
> 

Hi Tony,

In part2, does it means the memory allocated from kernel should use mirrored memory?

I have heard of this feature(address range mirroring) before, and I changed some
code to test it(implement memory allocations in specific physical areas).

In my opinion, add a new zone(ZONE_MIRROR) to fill the mirrored memory is not a good
idea. If there are XX discontiguous mirrored areas in one numa node, there should be
XX ZONE_MIRROR zones in one pgdat, it is impossible, right?

I think add a new migrate type(MIGRATE_MIRROR) will be better, the following print
is from my changed kernel. 

[root@localhost ~]# cat /proc/pagetypeinfo
Page block order: 9
Pages per block:  512

Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
Node    0, zone      DMA, type    Unmovable      1      1      1      0      2      1      1      0      1      0      0
Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      0      3
Node    0, zone      DMA, type       Mirror      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type      Reserve      0      0      0      0      0      0      0      0      0      1      0
Node    0, zone      DMA, type          CMA      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone    DMA32, type    Unmovable     14      7      6      1      3      0      1      0      0      0      0
Node    0, zone    DMA32, type  Reclaimable     15      2      2      1      1      2      1      1      0      0      0
Node    0, zone    DMA32, type      Movable      3     24     52     58     31      2      1      1      1      3    231
Node    0, zone    DMA32, type       Mirror      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone    DMA32, type      Reserve      0      0      0      0      0      0      0      0      0      0      1
Node    0, zone    DMA32, type          CMA      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone   Normal, type    Unmovable     80     12      6      7      3      1     67     58     23     11      0
Node    0, zone   Normal, type  Reclaimable      6      6      8     11      5      3      0      1      0      0      0
Node    0, zone   Normal, type      Movable      6    198    618    675    363     13      4      3      0      2   4074
Node    0, zone   Normal, type       Mirror      0      0      0      0      0      0      0      0      0      0   1024
Node    0, zone   Normal, type      Reserve      0      0      0      0      0      0      0      0      0      0      1
Node    0, zone   Normal, type          CMA      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0

Number of blocks type     Unmovable  Reclaimable      Movable       Mirror      Reserve          CMA      Isolate
Node 0, zone      DMA            1            0            6            0            1            0            0
Node 0, zone    DMA32            8           32          975            0            1            0            0
Node 0, zone   Normal          216          334        12760         2048            2            0            0
Page block order: 9
Pages per block:  512

Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
Node    1, zone   Normal, type    Unmovable     18      2     19      3     21     28     13      0      1      1      0
Node    1, zone   Normal, type  Reclaimable      0      1      1      1      0      0      1      0      0      1      0
Node    1, zone   Normal, type      Movable      6     13      9      3      0      4      5      0      1      0   6970
Node    1, zone   Normal, type       Mirror      0      0      0      0      0      0      0      0      0      0   1024
Node    1, zone   Normal, type      Reserve      0      0      0      0      0      0      0      0      0      0      1
Node    1, zone   Normal, type          CMA      0      0      0      0      0      0      0      0      0      0      0
Node    1, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0

Number of blocks type     Unmovable  Reclaimable      Movable       Mirror      Reserve          CMA      Isolate
Node 1, zone   Normal          112            4        14218         2048            2            0            0


Also I add a new flag(GFP_MIRROR), then we can use the mirrored form both
kernel-space and user-space. If there is no mirrored memory, we will allocate
other types memory.

1) kernel-space(pcp, page buddy, slab/slub ...):
	-> use mirrored memory(e.g. /proc/sys/vm/mirrorable)
		-> __alloc_pages_nodemask()
			->gfpflags_to_migratetype()
				-> use MIGRATE_MIRROR list
2) user-space(syscall, madvise, mmap ...):
	-> add VM_MIRROR flag in the vma
		-> add GFP_MIRROR when page fault in the vma
			-> __alloc_pages_nodemask()
				-> use MIGRATE_MIRROR list

Thanks,
Xishi Qiu

> Tony Luck (3):
>   mm/memblock: Add extra "flags" to memblock to allow selection of
>     memory based on attribute
>   mm/memblock: Allocate boot time data structures from mirrored memory
>   x86, mirror: x86 enabling - find mirrored memory ranges
> 
>  arch/s390/kernel/crash_dump.c |   5 +-
>  arch/sparc/mm/init_64.c       |   6 ++-
>  arch/x86/kernel/check.c       |   3 +-
>  arch/x86/kernel/e820.c        |   3 +-
>  arch/x86/kernel/setup.c       |   3 ++
>  arch/x86/mm/init_32.c         |   2 +-
>  arch/x86/platform/efi/efi.c   |  21 ++++++++
>  include/linux/efi.h           |   3 ++
>  include/linux/memblock.h      |  49 +++++++++++------
>  mm/cma.c                      |   6 ++-
>  mm/memblock.c                 | 123 +++++++++++++++++++++++++++++++++---------
>  mm/memtest.c                  |   3 +-
>  mm/nobootmem.c                |  14 ++++-
>  13 files changed, 188 insertions(+), 53 deletions(-)
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]