Re: x86 remap allocator in kernel 3.0

Petr Tesarik <ptesarik@xxxxxxx> · Wed, 11 Jan 2012 00:37:50 +0100

Dne Út 10. ledna 2012 20:24:58 Dave Anderson napsal(a):
> ----- Original Message -----
> 
> > Hi folks,
> > 
> > I've just discovered that the crash utility fails to initialize the vm
> > subsystem properly on our latest SLES 32-bit kernels. It turns out that
> > our kernels are compiled with CONFIG_DISCONTIGMEM=y, which causes pgdat
> > structs to be allocated by the remap allocator (cf.
> > arch/x86/mm/numa_32.c and also the code in setup_node_data).
> > 
> > If you don't know what the remap allocator is (like I didn't before I hit
> > the bug), it's a very special early-boot allocator which remaps physical
> > pages from low memory to high memory, giving them virtual addresses from
> > the
> > 
> > identity mapping. Looks a bit like this:
> >                         physical addr
> >                        
> >                        +------------+
> >                        
> >                        +------------+
> >                   
> >                   +--> |  KVA RAM   |
> >                   
> >                   |    +------------+
> >                   |    
> >                   |    \/\/\/\/\/\/\/
> >                   |    /\/\/\/\/\/\/\
> >   
> >   virtual addr    |    |  highmem   |
> >  
> >  +------------+   |    |------------|
> >  
> >  |            | -----> |            |
> >  
> >  +------------+   |    +------------+
> >  
> >  |  remap va  | --+    |   KVA PG   | (unused)
> >  
> >  +------------+        +------------+
> >  
> >  |            | -----> | RAM bottom |
> >  
> >  +------------+        +------------+
> > 
> > This breaks a very basic assumption that crash makes about low-memory
> > virtual addresses.
> 
> Hmmm, yeah, I am also unaware of this, and I'm not entirely clear based
> upon your explanation.  What do "KVA PG" and "KVA RAM" mean exactly?  And
> do just the pgdat structures (which I know can be huge) get moved from low
> to high physical memory (per-node perhaps), and then remapped with mapped
> virtual addresses?

Well, the concept dates back to Martin Bligh's patch in 2002 which added this 
for NUMA-Q. My understanding is that "KVA PG" refers to the kernel virtual 
addresses used to access the pgdat array as well as to the physical memory 
that corresponds to these virtual addresses if they were identity-mappe. This 
physical memory is then inaccessible.

"KVA RAM", on the other hand, is where the pgdat structures are actually 
stored. Please note that there is no "moving" of the structures, because this 
remapping occurs when memory nodes are initialized, i.e. before any access to 
it.

Regarding your second question, anything can theoretically call alloc_remap() 
to allocate memory from this region, but nothing does, and by looking at 
init_alloc_remap(), the size of the pool is always calculated as the size of 
the pgdat array plus struct pglist_data, rounded up to a multiple of 2MB (so 
that large pages can be used), so there's really only room for pgdat.

> Anyway, I trust you know what you're doing...

Thank you for the trust.

> > The attached patch fixes the issue for me, but may not be the cleanest
> > method to handle these mappings.
> 
> Anyway, what I can't wrap my head around is that the initialization
> sequence is being done by the first call to x86_ktop_PAE(), which calls
> x86_kvtop_remap(), which calls initialize_remap(), which calls readmem(),
> which calls x86_kvtop_PAE(), starting the whole thing over again.  How
> does that recursion work?  Would it be possible to call initialize_remap()
> earlier on instead of doing it upon the first kvtop() call?

Agreed. My thinking was that each node has its own remap region, so I want to 
know the number of nodes first. Since I didn't want to duplicate the 
heuristics used to determine the number of nodes, I couldn't initialize before 
vm_init. Then again, the remap mapping is accessed before vm_init() finishes.

I can see now that this is unnecessarily complicated, because the node_remap_* 
variables are static arrays of MAX_NUMNODES elements, so I can get their size 
from the debuginfo at POST_GDB init and initialize a machine-specific data 
type with it. I'll post another patch tomorrow.

Thanks for the hint!
Petr Tesarik
SUSE Linux

> > Ken'ichi Ohmichi, please note that makedumpfile is also affected by this
> > deficiency. On my test system, it will fail to produce any output if I
> > set dump level to anything greater than zero:
> > 
> > makedumpfile -c -d 31 -x vmlinux-3.0.13-0.5-pae.debug vmcore kdump.31
> > readmem: Can't convert a physical address(34a012b4) to offset.
> > readmem: type_addr: 0, addr:f4a012b4, size:4
> > get_mm_discontigmem: Can't get node_start_pfn.
> > 
> > makedumpfile Failed.
> > 
> > However, fixing this for makedumpfile is harder, and it will most likely
> > require a few more lines in VMCOREINFO, because debug symbols may not be
> > available at dump time, and I can't see any alternative method to locate
> > the remapped regions.
> > 
> > Regards,
> > Petr Tesarik
> > SUSE Linux
> 
> --
> Crash-utility mailing list
> Crash-utility@xxxxxxxxxx
> https://www.redhat.com/mailman/listinfo/crash-utility

--
Crash-utility mailing list
Crash-utility@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/crash-utility