Re: [PATCH makedumpfile] Fix incorrect PFN exclusion when LOAD segments overlap

Ming Wang <wangming01@xxxxxxxxxxx> · Wed, 9 Oct 2024 11:32:21 +0800

Hi Kazu,

Thank you for your insightful analysis and suggestion!

在 2024/10/9 09:47, HAGIO KAZUHITO(萩尾 一仁) 写道:
Hi Ming,

thank you for the detailed description.

On 2024/10/08 17:43, Ming Wang wrote:

On 2024/08/14 11:43, Ming Wang wrote:
When iterating through LOAD segments to exclude pages in
`exclude_nodata_pages`, simply using `phys_end` to track allocated
PFNs can lead to the erroneous exclusion of valid pages from a
subsequent LOAD segment if its `phys_start` overlaps with the
previous segment's address range.

This patch addresses the issue by checking for such overlaps with the
next LOAD segment's physical address range. If an overlap is
detected, we continue including PFNs from the current segment until
reaching the end of the overlapping region.

This fix ensures that all valid pages within overlapping LOAD segments
are correctly included during PFN exclusion.

I have a few questions:

- I heard from Masa that the overlap is caused by a bug, isn't it fixed
on the kernel side?  or it will be fixed but is this patch for existing
kernels that have the bug?
The overlap you mentioned is not caused by a kernel bug and is not expected to be fixed in the kernel.

I see.

- exclude_nodata_pages() is for dumpfiles created by "makedumpfile -E",
which uses p_filesz in program headers to record existing pages after
exclusion.  The function excludes pages only between
phys_start+file_size and phys_end, so if p_filesz == p_memsz, it will
not exclude any pages.
My system's page size is 16K, Consider the following LOAD segment information

phys_start         phys_end       virt_start         virt_end
LOAD[ 0]           200000          2186200 9000000000200000 9000000002186200
LOAD[ 1]           200000          ee00000 9000000000200000 900000000ee00000

I added some print statements within the exclude_nodata_pages function:

while (pfn < pfn_end) {
      ERRMSG("\nfunc:%s: pfn: %#llx phys_start: %#llx phys_end: %#llx file_size: %lu pfn_end:%#llx\n"
          , __func__, pfn, phys_start, phys_end, (unsigned long)file_size, pfn_end);
      clear_bit_on_2nd_bitmap(pfn, cycle);
      ++pfn;
}

The following output was observed:

func:exclude_nodata_pages: pfn: 0x861 phys_start: 0x200000 phys_end: 0x2186200 file_size: 33055232 pfn_end:0x862

After this print statement, clear_bit_on_2nd_bitmap is called, effectively clearing the bitmap bit for pfn: 0x861.
However, pfn: 0x861 is a valid PFN that happens to store the kernel's init_uts_ns. This leads to an error when the
crash tool attempts to detect the kernel version:

read_diskdump: PAGE_EXCLUDED: paddr/pfn: 2185d98/861
crash: page excluded: kernel virtual address: 9000000002185d98  type: "init_uts_ns"

So the calculated pfn:0x861 is wrong?
Yes, it's wrong.

LOAD           0x000000000000c000 0x9000000000200000 0x0000000000200000
                 0x0000000002550400 0x0000000002550400  RWE    0x0

This is a different example, it's expected that p_memsz equals to p_filesz.

But, the sizes are not aligned to the page size 16k, and

    pfn = paddr_to_pfn(phys_start + file_size);

it looks like this rounds down the last valid page..?  so

-               pfn = paddr_to_pfn(phys_start + file_size);
+               pfn = paddr_to_pfn(roundup(phys_start + file_size, PAGESIZE());
                  pfn_end = paddr_to_pfn(roundup(phys_end, PAGESIZE()));

How about this instead of your patch?
I agree that rounding up the calculated `pfn` to the nearest page 
boundary is a better approach to ensure that the last valid page of a 
LOAD segment is not accidentally excluded.

I've implemented your suggestion:

- pfn = paddr_to_pfn(phys_start + file_size);
+ pfn = paddr_to_pfn(roundup(phys_start + file_size, PAGESIZE()));

and tested it thoroughly. It effectively resolves the issue of valid 
pages being excluded due to rounding errors.

I've updated the patch accordingly and will be send v2 soon.

Thanks again for your help!

Best regards,
Ming

Thanks,
Kazu

I would like to check how they are on your machine, so could I have a
"readelf -l /proc/vmcore" output?
The output of the command readelf -l /proc/vmcore executed on my system is provided below.

It is important to note that this output corresponds to a different kernel version compared to the one used in the preceding example:

[root@localhost ~]# readelf -l /proc/vmcore

Elf file type is CORE (Core file)
Entry point 0x0
There are 8 program headers, starting at offset 64

Program Headers:
Type           Offset             VirtAddr           PhysAddr
FileSiz            MemSiz              Flags  Align
NOTE           0x0000000000004000 0x0000000000000000 0x0000000000000000
                 0x0000000000004a54 0x0000000000004a54         0x4
LOAD           0x000000000000c000 0x9000000000200000 0x0000000000200000
                 0x0000000002550400 0x0000000002550400  RWE    0x0
LOAD           0x0000000002560000 0x9000000000200000 0x0000000000200000
                 0x000000000ec00000 0x000000000ec00000  RWE    0x0
LOAD           0x0000000011160000 0x9000000090400000 0x0000000090400000
                 0x0000000011200000 0x0000000011200000  RWE    0x0
LOAD           0x0000000022360000 0x90000000f8e00000 0x00000000f8e00000
                 0x00000000000c0000 0x00000000000c0000  RWE    0x0
LOAD           0x0000000022420000 0x90000000f8ed0000 0x00000000f8ed0000
                 0x0000000004c70000 0x0000000004c70000  RWE    0x0
LOAD           0x0000000027090000 0x90000000fe100000 0x00000000fe100000
                 0x0000000f81f00000 0x0000000f81f00000  RWE    0x0
LOAD           0x0000000fa8f90000 0x9000100080000000 0x0000100080000000
                 0x0000001000000000 0x0000001000000000  RWE    0x0

- Do you mean that the bug always puts the two LOAD segments that have
the same phys_start in a row?
Yes, as demonstrated in the example above.

Thanks,
Ming

_______________________________________________
kexec mailing list
kexec@xxxxxxxxxxxxxxxxxxx
http://lists.infradead.org/mailman/listinfo/kexec