----- Original Message ----- > Hi Dave, > > On 2/21/2018 4:14 PM, Dave Anderson wrote: > > > > > > ----- Original Message ----- > >> Hi Dave, > >> > >> Thank you so much for your review! > >> > >> On 2/21/2018 11:41 AM, Dave Anderson wrote: > >>> > >>> Hi Kasuhito, > >>> > >>> Just as a follow-up review of this part of your original patch: > >>> > >>> +#ifdef VMEMMAP_VADDR > >>> + nr = nr_mem_sections; > >>> + if (machdep->flags & VMEMMAP) > >>> + nr = pfn_to_section_nr((addr - VMEMMAP_VADDR) / > >>> SIZE(page)); > >>> + else if (readmem(addr + OFFSET(page_flags), KVADDR, > >>> &flags, > >>> + sizeof(ulong), "page.flags", > >>> RETURN_ON_ERROR|QUIET)) > >>> + nr = (flags >> (SIZE(page_flags)*8 - > >>> SECTIONS_SHIFT()) > >>> + & ((1UL << SECTIONS_SHIFT()) - 1)); > >>> + > >>> + if (nr < nr_mem_sections) { > >>> +#else > >>> for (nr = 0; nr < nr_mem_sections ; nr++) { > >>> +#endif > >>> > >>> One of my original concerns was associated with the > >>> backwards-compatiblity > >>> of the non-VMEMMAP page->flags usage, primarily because it has changed > >>> over the years. Perhaps the SECTIONS_SHIFT part has remained the same, > >>> but depending upon its future stability in this function still worries > >>> me. > >> > >> Yes, I understand the concern. > >> > >> The SECTIONS_SHIFT part is the same as the function page_to_section() in > >> include/linux/mm.h. I thought that if the values related to > >> SECTIONS_SHIFT > >> in kernel change, the crash utility will have to follow it regardless of > >> this patch, because they are used commonly in crash. But possible changes > >> could not be limited to such values. > > > > It's true that crash should follow the upstream values, but in the past > > there have been periods of times when the MAX_PHYSMEM_BITS and > > SECTION_SIZE_BITS > > for different architectures changed upstream, but were not immediately > > updated in > > the crash utility. And that was because crash still worked OK because most > > systems did not have large enough memory for the changes to cause things to > > fail. > > Thank you, I understood it more precisely. > > [...] > > So this goes to the question as to whether is_page_ptr() should return TRUE > > if the incoming page address is accessible(), but it references a physical > > address > > that does not exist. With the current code, it verifies that it's a page > > address, > > and that the page address points to an actual physical memory page. I > > suggested > > using accessible() on the page structure address, but that would not > > necessarily > > be accurate because theoretically a vmemmap'd/instantiated page of page > > structures > > could contain page structure addresses that do not reference physical > > memory. > > (e.g., if a hole started at a page address that's in the middle of a > > vmemmap'd > > page of page structures) > > As to the last example (IIUC), I had confirmed that accessible() returned > FALSE if a page address was in a hole like below on x86_64/RHEL7, so I was > writing the prototype patch. > > crash> kmem -n > ... > NR SECTION CODED_MEM_MAP MEM_MAP PFN > ... > 23 ffff88043ffd82e0 ffffea0000000000 ffffea0002e00000 753664 ## There > is a > 32 ffff88043ffd8400 ffffea0000000000 ffffea0004000000 1048576 ## hole > here. > ... > crash> kmem ffffea0002ffffc0 ## The last page struct of NR=23 > DEBUG: addr=0xffffea0002ffffc0 nr=23 accessible=1 > DEBUG: addr=0xffffea0002ffffc0 nr=23 accessible=1 ## in is_page_ptr() > DEBUG: addr=0xffffea0002ffffc0 nr=23 accessible=1 > DEBUG: addr=0xffffea0002ffffc0 nr=23 accessible=1 > PAGE PHYSICAL MAPPING INDEX CNT FLAGS > ffffea0002ffffc0 bffff000 0 0 1 1fffff00000000 > crash> kmem ffffea0003000000 ## A page address in a hole. > kmem: WARNING: cannot make virtual-to-physical translation: > ffffea0003000000 > DEBUG: addr=0xffffea0003000000 nr=24 accessible=0 > DEBUG: addr=0xffffea0003000000 nr=24 accessible=0 ## returned 0 > DEBUG: addr=0xffffea0003000000 nr=24 accessible=0 > ffffea0003000000: kernel virtual address not found in mem map > > I'm not sure whether there is the case that a page address does not > reference physical memory except for the above. But originally, > since accessible() is readmem(), which actually reads a dump file, > it could return FALSE also due to some errors quietly, and this could > leads to a wrong judgement. > > And I had thought that if accessible() was valid for the page validity > test, there was no need to calculate its section. So could it remove > the accessible() and continue to utilize the valid_section_nr() section > for the test like this?: > --- > diff --git a/defs.h b/defs.h > index aa17792..ab98cb7 100644 > --- a/defs.h > +++ b/defs.h > @@ -3335,6 +3335,9 @@ struct arm64_stackframe { > #define VTOP(X) x86_64_VTOP((ulong)(X)) > #define IS_VMALLOC_ADDR(X) x86_64_IS_VMALLOC_ADDR((ulong)(X)) > > +#define IS_VMEMMAP_PAGE_ADDR(X) x86_64_IS_VMEMMAP_PAGE_ADDR((ulong)(X)) > +#define VMEMMAP_PAGE_TO_PFN(X) (((ulong)(X) - VMEMMAP_VADDR) / SIZE(page)) > + > /* > * the default page table level for x86_64: > * 4 level page tables > @@ -5646,6 +5649,7 @@ void x86_64_dump_machdep_table(ulong); > ulong x86_64_PTOV(ulong); > ulong x86_64_VTOP(ulong); > int x86_64_IS_VMALLOC_ADDR(ulong); > +int x86_64_IS_VMEMMAP_PAGE_ADDR(ulong); > void x86_64_display_idt_table(void); > #define display_idt_table() x86_64_display_idt_table() > long x86_64_exception_frame(ulong, ulong, char *, struct bt_info *, FILE *); > diff --git a/memory.c b/memory.c > index 0df8ecc..928c3c2 100644 > --- a/memory.c > +++ b/memory.c > @@ -13350,6 +13350,30 @@ is_page_ptr(ulong addr, physaddr_t *phys) > physaddr_t section_paddr; > > if (IS_SPARSEMEM()) { > +#ifdef IS_VMEMMAP_PAGE_ADDR > + if (machdep->flags & VMEMMAP) { > + if (IS_VMEMMAP_PAGE_ADDR(addr)) { > + nr = pfn_to_section_nr(VMEMMAP_PAGE_TO_PFN(addr)); > + if ((sec_addr = valid_section_nr(nr))) { > + coded_mem_map = section_mem_map_addr(sec_addr); > + mem_map = sparse_decode_mem_map(coded_mem_map, nr); > + end_mem_map = mem_map + (PAGES_PER_SECTION() * SIZE(page)); > + > + if ((addr >= mem_map) && (addr < end_mem_map)) { > + if ((addr - mem_map) % SIZE(page)) > + return FALSE; > + if (phys) { > + section_paddr = PTOB(section_nr_to_pfn(nr)); > + pgnum = (addr - mem_map) / SIZE(page); > + *phys = section_paddr + ((physaddr_t)pgnum * PAGESIZE()); > + } > + return TRUE; > + } > + } > + } > + return FALSE; > + } > +#endif > nr_mem_sections = NR_MEM_SECTIONS(); > for (nr = 0; nr < nr_mem_sections ; nr++) { > if ((sec_addr = valid_section_nr(nr))) { > diff --git a/x86_64.c b/x86_64.c > index 7449571..7240c5d 100644 > --- a/x86_64.c > +++ b/x86_64.c > @@ -1570,6 +1570,14 @@ x86_64_IS_VMALLOC_ADDR(ulong vaddr) > (vaddr >= VSYSCALL_START && vaddr < VSYSCALL_END)); > } > > +int > +x86_64_IS_VMEMMAP_PAGE_ADDR(ulong vaddr) > +{ > + return ((machdep->flags & VMEMMAP) && > + (vaddr >= VMEMMAP_VADDR && vaddr <= VMEMMAP_END) && > + !((vaddr - VMEMMAP_VADDR) % SIZE(page))); > +} > + > static int > x86_64_is_module_addr(ulong vaddr) > { > --- > > > So if we don't want to change the functionality of is_page_ptr(), then maybe > > the binary search would be a suitable compromise for both accuracy and speed > > on extremely large systems? > > I have not considered it enough yet, but if all ranges of mem_sections are > ascending for section numbers and vmemmap holes like above are to be managed > well, it might be good. (I'm guessing that the binary search might need an > auxiliary array or something due to the vmemmap holes..) > > Thanks, > Kazuhito Hagio This "#ifdef IS_VMEMMAP_PAGE_ADDR" patch is getting really ugly. I've been playing around with this, and this is my latest counter-proposal. First, the mem_section numbers are ascending. They may not necessarily start with 0, and there may be holes, but they are ascending. That being the case, there is no need for is_page_ptr() to walk through NR_MEM_SECTIONS() worth of entries, because there will be an ending number that's typically much smaller. Even on a 256GB dumpfile I have on hand, which has a NR_MEM_SECTIONS() value of 524288, the largest valid section number is 2055. (What is the smallest and largest number that you see on whatever large-memory system that you are testing with?) In any case, let's store the largest section number during initialization in the vm_table, and use it as a delimeter in is_page_ptr(). diff --git a/defs.h b/defs.h index aa17792..8768fd5 100644 --- a/defs.h +++ b/defs.h @@ -2369,6 +2369,7 @@ struct vm_table { /* kernel VM-related data */ ulong mask; char *name; } *pageflags_data; + ulong max_mem_section_nr; }; #define NODES (0x1) diff --git a/memory.c b/memory.c index 0df8ecc..6770937 100644 --- a/memory.c +++ b/memory.c @@ -255,7 +255,7 @@ static void PG_reserved_flag_init(void); static void PG_slab_flag_init(void); static ulong nr_blockdev_pages(void); void sparse_mem_init(void); -void dump_mem_sections(void); +void dump_mem_sections(int); void list_mem_sections(void); ulong sparse_decode_mem_map(ulong, ulong); char *read_mem_section(ulong); @@ -13350,7 +13350,7 @@ is_page_ptr(ulong addr, physaddr_t *phys) physaddr_t section_paddr; if (IS_SPARSEMEM()) { - nr_mem_sections = NR_MEM_SECTIONS(); + nr_mem_sections = vt->max_mem_section_nr+1; for (nr = 0; nr < nr_mem_sections ; nr++) { if ((sec_addr = valid_section_nr(nr))) { coded_mem_map = section_mem_map_addr(sec_addr); @@ -13668,6 +13668,7 @@ dump_vm_table(int verbose) fprintf(fp, " swap_info_struct: %lx\n", (ulong)vt->swap_info_struct); fprintf(fp, " mem_sec: %lx\n", (ulong)vt->mem_sec); fprintf(fp, " mem_section: %lx\n", (ulong)vt->mem_section); + fprintf(fp, " max_mem_section_nr: %ld\n", vt->max_mem_section_nr); fprintf(fp, " ZONE_HIGHMEM: %d\n", vt->ZONE_HIGHMEM); fprintf(fp, "node_online_map_len: %d\n", vt->node_online_map_len); if (vt->node_online_map_len) { @@ -16295,8 +16296,8 @@ dump_memory_nodes(int initialize) vt->numnodes = n; } - if (!initialize && IS_SPARSEMEM()) - dump_mem_sections(); + if (IS_SPARSEMEM()) + dump_mem_sections(initialize); } /* @@ -17128,9 +17129,9 @@ pfn_to_map(ulong pfn) } void -dump_mem_sections(void) +dump_mem_sections(int initialize) { - ulong nr,addr; + ulong nr, max, addr; ulong nr_mem_sections; ulong coded_mem_map, mem_map, pfn; char buf1[BUFSIZE]; @@ -17140,6 +17141,15 @@ dump_mem_sections(void) nr_mem_sections = NR_MEM_SECTIONS(); + if (initialize) { + for (nr = max = 0; nr < nr_mem_sections ; nr++) { + if (valid_section_nr(nr)) + max = nr; + } + vt->max_mem_section_nr = max; + return; + } + fprintf(fp, "\n"); pad_line(fp, BITS32() ? 59 : 67, '-'); fprintf(fp, "\n\nNR %s %s %s PFN\n", Now, with respect to the architecture-specific, VMEMMAP-only, part that is of most interest to you, let's do it with an architecture specific callback. You can post it for x86_64, and the other architecture maintainers can write their own version. For example, add a new callback function to the machdep_table structure, i.e., like this: struct machdep_table { ulong flags; ulong kvbase; ... void (*get_irq_affinity)(int); void (*show_interrupts)(int, ulong *); + int is_page_ptr(ulong, physaddr_t *); }; Write the x86_64_is_page_ptr() function that works for VMEMMAP kernels, and returns FALSE otherwise. And add the call to the top of is_page_ptr() like this: int is_page_ptr(ulong addr, physaddr_t *phys) { int n; ulong ppstart, ppend; struct node_table *nt; ulong pgnum, node_size; ulong nr, sec_addr; ulong nr_mem_sections; ulong coded_mem_map, mem_map, end_mem_map; physaddr_t section_paddr; + if (machdep->is_page_ptr(addr, phys)) + return TRUE; if (IS_SPARSEMEM()) { ... To avoid having to check whether the machdep->is_page_ptr function exists, write a generic_is_page_ptr() function that just returns FALSE, and initialize machdep->is_page_ptr to generic_is_page_ptr in the setup_environment() function. Later on, individual architectures can overwrite it when machdep_init(SETUP_ENV) is called. How's that sound? Dave -- Crash-utility mailing list Crash-utility@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/crash-utility