On Sat, Nov 16, 2024 at 6:59 PM Pasha Tatashin <pasha.tatashin@xxxxxxxxxx> wrote: > Page Detective is a kernel debugging tool that provides detailed > information about the usage and mapping of physical memory pages. > > It operates through the Linux debugfs interface, providing access > to both virtual and physical address inquiries. The output, presented > via kernel log messages (accessible with dmesg), will help > administrators and developers understand how specific pages are > utilized by the system. > > This tool can be used to investigate various memory-related issues, > such as checksum failures during live migration, filesystem journal > failures, general segfaults, or other corruptions. [...] > +/* > + * Walk kernel page table, and print all mappings to this pfn, return 1 if > + * pfn is mapped in direct map, return 0 if not mapped in direct map, and > + * return -1 if operation canceled by user. > + */ > +static int page_detective_kernel_map_info(unsigned long pfn, > + unsigned long direct_map_addr) > +{ > + struct pd_private_kernel pr = {0}; > + unsigned long s, e; > + > + pr.direct_map_addr = direct_map_addr; > + pr.pfn = pfn; > + > + for (s = PAGE_OFFSET; s != ~0ul; ) { > + e = s + PD_WALK_MAX_RANGE; > + if (e < s) > + e = ~0ul; > + > + if (walk_page_range_kernel(s, e, &pd_kernel_ops, &pr)) { I think which parts of the kernel virtual address range you can safely pagewalk is somewhat architecture-specific; for example, X86 can run under Xen PV, in which case I think part of the page tables may not be walkable because they're owned by the hypervisor for its own use? Notably the x86 version of ptdump_walk_pgd_level_core starts walking at GUARD_HOLE_END_ADDR instead. See also https://kernel.org/doc/html/latest/arch/x86/x86_64/mm.html for an ASCII table reference on address space regions. > + pr_info("Received a cancel signal from user, while scanning kernel mappings\n"); > + return -1; > + } > + cond_resched(); > + s = e; > + } > + > + if (!pr.vmalloc_maps) { > + pr_info("The page is not mapped into kernel vmalloc area\n"); > + } else if (pr.vmalloc_maps > 1) { > + pr_info("The page is mapped into vmalloc area: %ld times\n", > + pr.vmalloc_maps); > + } > + > + if (!pr.direct_map) > + pr_info("The page is not mapped into kernel direct map\n"); > + > + pr_info("The page mapped into kernel page table: %ld times\n", pr.maps); > + > + return pr.direct_map ? 1 : 0; > +} > + > +/* Print kernel information about the pfn, return -1 if canceled by user */ > +static int page_detective_kernel(unsigned long pfn) > +{ > + unsigned long *mem = __va((pfn) << PAGE_SHIFT); > + unsigned long sum = 0; > + int direct_map; > + u64 s, e; > + int i; > + > + s = sched_clock(); > + direct_map = page_detective_kernel_map_info(pfn, (unsigned long)mem); > + e = sched_clock() - s; > + pr_info("Scanned kernel page table in [%llu.%09llus]\n", > + e / NSEC_PER_SEC, e % NSEC_PER_SEC); > + > + /* Canceled by user or no direct map */ > + if (direct_map < 1) > + return direct_map; > + > + for (i = 0; i < PAGE_SIZE / sizeof(unsigned long); i++) > + sum |= mem[i]; If the purpose of this interface is to inspect pages in weird states, I wonder if it would make sense to use something like copy_mc_to_kernel() in case that helps avoid kernel crashes due to uncorrectable 2-bit ECC errors or such. But maybe that's not the kind of error you're concerned about here? And I also don't have any idea if copy_mc_to_kernel() actually does anything sensible for ECC errors. So don't treat this as a fix suggestion, more as a random idea that should probably be ignored unless someone who understands ECC errors says it makes sense. But I think you should at least be using READ_ONCE(), since you're reading from memory that can change concurrently. > + if (sum == 0) > + pr_info("The page contains only zeroes\n"); > + else > + pr_info("The page contains some data\n"); > + > + return 0; > +} [...] > +/* > + * print information about mappings of pfn by mm, return -1 if canceled > + * return number of mappings found. > + */ > +static long page_detective_user_mm_info(struct mm_struct *mm, unsigned long pfn) > +{ > + struct pd_private_user pr = {0}; > + unsigned long s, e; > + > + pr.pfn = pfn; > + pr.mm = mm; > + > + for (s = 0; s != TASK_SIZE; ) { TASK_SIZE does not make sense when inspecting another task, because TASK_SIZE depends on the virtual address space size of the current task (whether you are a 32-bit or 64-bit process). Please use TASK_SIZE_MAX for remote process access. > + e = s + PD_WALK_MAX_RANGE; > + if (e > TASK_SIZE || e < s) > + e = TASK_SIZE; > + > + if (mmap_read_lock_killable(mm)) { > + pr_info("Received a cancel signal from user, while scanning user mappings\n"); > + return -1; > + } > + walk_page_range(mm, s, e, &pd_user_ops, &pr); > + mmap_read_unlock(mm); > + cond_resched(); > + s = e; > + } > + return pr.maps; > +}