Re: [RFCv1 4/6] misc/page_detective: Introduce Page Detective

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Nov 16, 2024 at 6:59 PM Pasha Tatashin
<pasha.tatashin@xxxxxxxxxx> wrote:
> Page Detective is a kernel debugging tool that provides detailed
> information about the usage and mapping of physical memory pages.
>
> It operates through the Linux debugfs interface, providing access
> to both virtual and physical address inquiries. The output, presented
> via kernel log messages (accessible with dmesg), will help
> administrators and developers understand how specific pages are
> utilized by the system.
>
> This tool can be used to investigate various memory-related issues,
> such as checksum failures during live migration, filesystem journal
> failures, general segfaults, or other corruptions.
[...]
> +/*
> + * Walk kernel page table, and print all mappings to this pfn, return 1 if
> + * pfn is mapped in direct map, return 0 if not mapped in direct map, and
> + * return -1 if operation canceled by user.
> + */
> +static int page_detective_kernel_map_info(unsigned long pfn,
> +                                         unsigned long direct_map_addr)
> +{
> +       struct pd_private_kernel pr = {0};
> +       unsigned long s, e;
> +
> +       pr.direct_map_addr = direct_map_addr;
> +       pr.pfn = pfn;
> +
> +       for (s = PAGE_OFFSET; s != ~0ul; ) {
> +               e = s + PD_WALK_MAX_RANGE;
> +               if (e < s)
> +                       e = ~0ul;
> +
> +               if (walk_page_range_kernel(s, e, &pd_kernel_ops, &pr)) {

I think which parts of the kernel virtual address range you can safely
pagewalk is somewhat architecture-specific; for example, X86 can run
under Xen PV, in which case I think part of the page tables may not be
walkable because they're owned by the hypervisor for its own use?
Notably the x86 version of ptdump_walk_pgd_level_core starts walking
at GUARD_HOLE_END_ADDR instead.

See also https://kernel.org/doc/html/latest/arch/x86/x86_64/mm.html
for an ASCII table reference on address space regions.

> +                       pr_info("Received a cancel signal from user, while scanning kernel mappings\n");
> +                       return -1;
> +               }
> +               cond_resched();
> +               s = e;
> +       }
> +
> +       if (!pr.vmalloc_maps) {
> +               pr_info("The page is not mapped into kernel vmalloc area\n");
> +       } else if (pr.vmalloc_maps > 1) {
> +               pr_info("The page is mapped into vmalloc area: %ld times\n",
> +                       pr.vmalloc_maps);
> +       }
> +
> +       if (!pr.direct_map)
> +               pr_info("The page is not mapped into kernel direct map\n");
> +
> +       pr_info("The page mapped into kernel page table: %ld times\n", pr.maps);
> +
> +       return pr.direct_map ? 1 : 0;
> +}
> +
> +/* Print kernel information about the pfn, return -1 if canceled by user */
> +static int page_detective_kernel(unsigned long pfn)
> +{
> +       unsigned long *mem = __va((pfn) << PAGE_SHIFT);
> +       unsigned long sum = 0;
> +       int direct_map;
> +       u64 s, e;
> +       int i;
> +
> +       s = sched_clock();
> +       direct_map = page_detective_kernel_map_info(pfn, (unsigned long)mem);
> +       e = sched_clock() - s;
> +       pr_info("Scanned kernel page table in [%llu.%09llus]\n",
> +               e / NSEC_PER_SEC, e % NSEC_PER_SEC);
> +
> +       /* Canceled by user or no direct map */
> +       if (direct_map < 1)
> +               return direct_map;
> +
> +       for (i = 0; i < PAGE_SIZE / sizeof(unsigned long); i++)
> +               sum |= mem[i];

If the purpose of this interface is to inspect pages in weird states,
I wonder if it would make sense to use something like
copy_mc_to_kernel() in case that helps avoid kernel crashes due to
uncorrectable 2-bit ECC errors or such. But maybe that's not the kind
of error you're concerned about here? And I also don't have any idea
if copy_mc_to_kernel() actually does anything sensible for ECC errors.
So don't treat this as a fix suggestion, more as a random idea that
should probably be ignored unless someone who understands ECC errors
says it makes sense.

But I think you should at least be using READ_ONCE(), since you're
reading from memory that can change concurrently.

> +       if (sum == 0)
> +               pr_info("The page contains only zeroes\n");
> +       else
> +               pr_info("The page contains some data\n");
> +
> +       return 0;
> +}
[...]
> +/*
> + * print information about mappings of pfn by mm, return -1 if canceled
> + * return number of mappings found.
> + */
> +static long page_detective_user_mm_info(struct mm_struct *mm, unsigned long pfn)
> +{
> +       struct pd_private_user pr = {0};
> +       unsigned long s, e;
> +
> +       pr.pfn = pfn;
> +       pr.mm = mm;
> +
> +       for (s = 0; s != TASK_SIZE; ) {

TASK_SIZE does not make sense when inspecting another task, because
TASK_SIZE depends on the virtual address space size of the current
task (whether you are a 32-bit or 64-bit process). Please use
TASK_SIZE_MAX for remote process access.

> +               e = s + PD_WALK_MAX_RANGE;
> +               if (e > TASK_SIZE || e < s)
> +                       e = TASK_SIZE;
> +
> +               if (mmap_read_lock_killable(mm)) {
> +                       pr_info("Received a cancel signal from user, while scanning user mappings\n");
> +                       return -1;
> +               }
> +               walk_page_range(mm, s, e, &pd_user_ops, &pr);
> +               mmap_read_unlock(mm);
> +               cond_resched();
> +               s = e;
> +       }
> +       return pr.maps;
> +}





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]     [Monitors]

  Powered by Linux