lijiang <lijiang@xxxxxxxxxx> writes: > 在 2019年10月15日 19:04, Eric W. Biederman 写道: >> lijiang <lijiang@xxxxxxxxxx> writes: >> >>> 在 2019年10月13日 11:54, Eric W. Biederman 写道: >>>> Dave Young <dyoung@xxxxxxxxxx> writes: >>>> >>>>> Hi Eric, >>>>> >>>>> On 10/12/19 at 06:26am, Eric W. Biederman wrote: >>>>>> Lianbo Jiang <lijiang@xxxxxxxxxx> writes: >>>>>> >>>>>>> When the crashkernel kernel command line option is specified, the >>>>>>> low 1MiB memory will always be reserved, which makes that the memory >>>>>>> allocated later won't fall into the low 1MiB area, thereby, it's not >>>>>>> necessary to create a backup region and also no need to copy the first >>>>>>> 640k content to a backup region. >>>>>>> >>>>>>> Currently, the code related to the backup region can be safely removed, >>>>>>> so lets clean up. >>>>>>> >>>>>>> Signed-off-by: Lianbo Jiang <lijiang@xxxxxxxxxx> >>>>>>> --- >>>>>> >>>>>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c >>>>>>> index eb651fbde92a..cc5774fc84c0 100644 >>>>>>> --- a/arch/x86/kernel/crash.c >>>>>>> +++ b/arch/x86/kernel/crash.c >>>>>>> @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs) >>>>>>> >>>>>>> #ifdef CONFIG_KEXEC_FILE >>>>>>> >>>>>>> -static unsigned long crash_zero_bytes; >>>>>>> - >>>>>>> static int get_nr_ram_ranges_callback(struct resource *res, void *arg) >>>>>>> { >>>>>>> unsigned int *nr_ranges = arg; >>>>>>> @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg) >>>>>>> { >>>>>>> struct crash_mem *cmem = arg; >>>>>>> >>>>>>> - cmem->ranges[cmem->nr_ranges].start = res->start; >>>>>>> - cmem->ranges[cmem->nr_ranges].end = res->end; >>>>>>> - cmem->nr_ranges++; >>>>>>> + if (res->start >= SZ_1M) { >>>>>>> + cmem->ranges[cmem->nr_ranges].start = res->start; >>>>>>> + cmem->ranges[cmem->nr_ranges].end = res->end; >>>>>>> + cmem->nr_ranges++; >>>>>>> + } else if (res->end > SZ_1M) { >>>>>>> + cmem->ranges[cmem->nr_ranges].start = SZ_1M; >>>>>>> + cmem->ranges[cmem->nr_ranges].end = res->end; >>>>>>> + cmem->nr_ranges++; >>>>>>> + } >>>>>> >>>>>> What is going on with this chunk? I can guess but this needs a clear >>>>>> comment. >>>>> >>>>> Indeed it needs some code comment, this is based on some offline >>>>> discussion. cat /proc/vmcore will give a warning because ioremap is >>>>> mapping the system ram. >>>>> >>>>> We pass the first 1M to kdump kernel in e820 as system ram so that 2nd >>>>> kernel can use the low 1M memory because for example the trampoline >>>>> code. >>>>> >>>>>> >>>>>>> >>>>>>> return 0; >>>>>>> } >>>>>> >>>>>>> @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params) >>>>>>> memset(&cmd, 0, sizeof(struct crash_memmap_data)); >>>>>>> cmd.params = params; >>>>>>> >>>>>>> - /* Add first 640K segment */ >>>>>>> - ei.addr = image->arch.backup_src_start; >>>>>>> - ei.size = image->arch.backup_src_sz; >>>>>>> + /* >>>>>>> + * Add the low memory range[0x1000, SZ_1M], skip >>>>>>> + * the first zero page. >>>>>>> + */ >>>>>>> + ei.addr = PAGE_SIZE; >>>>>>> + ei.size = SZ_1M - PAGE_SIZE; >>>>>>> ei.type = E820_TYPE_RAM; >>>>>>> add_e820_entry(params, &ei); >>>>>> >>>>>> Likewise here. Why do we need a special case? >>>>>> Why the magic with PAGE_SIZE? >>>>> >>>>> Good catch, the zero page part is useless, I think no other special >>>>> reason, just assumed zero page is not usable, but it should be ok to >>>>> remove the special handling, just pass 0 - 1M is good enough. >>>> >>>> But if we have stopped special casing the low 1M. Why do we need a >>>> special case here at all? >>>> >>> Here, need to pass the low memory range to kdump kernel, which will guarantee >>> the availability of low memory in kdump kernel, otherwise, kdump kernel won't >>> use the low memory region. >>> >>>> If you need the special case it is almost certainly wrong to say you >>>> have ram above 640KiB and below 1MiB. That is the legacy ROM and video >>>> MMIO area. >>>> >>>> There is a reason the original code said 640KiB. >>>> >>> Do you mean that the 640k region is good enough here instead of 1MiB? >> >> Reading through the code of crash_setup_memap_entries I see that what >> the code is doing now. The code is repeating the e820 memory map with >> the memory areas that were not reserved for the crash kernel removed. >> >> In which case what the code needs to be doing something like: >> >> cmd.type = E820_TYPE_RAM; >> flags = IORESOURCE_MEM; >> walk_iomem_res_desc(IORES_DESC_RESERVED, flags, 0, 1024*1024, &cmd, >> memmap_entry_callback); >> > The above code does not get the results what we expected, it gets the reserved > memory marked as 'IORES_DESC_RESERVED' in the low 1MiB range. > > Finally, kdump kernel happened the panic as follow: > ...... > [ 3.555662] Kernel panic - not syncing: Real mode trampoline was not allocated > [ 3.556660] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.4.0-rc3+ #4 > [ 3.556660] Hardware name: AMD Corporation Speedway/Speedway, BIOS RSW1009C 07/27/2018 > [ 3.556660] Call Trace: > [ 3.556660] dump_stack+0x46/0x60 > [ 3.556660] panic+0xfb/0x2d7 > [ 3.556660] ? hv_init_spinlocks+0x7f/0x7f > [ 3.556660] init_real_mode+0x27/0x1fa > [ 3.556660] ? hv_init_spinlocks+0x7f/0x7f > [ 3.556660] ? do_one_initcall+0x46/0x1e4 > [ 3.556660] ? proc_register+0xd0/0x130 > [ 3.556660] ? kernel_init_freeable+0xe2/0x242 > [ 3.556660] ? rest_init+0xaa/0xaa > [ 3.556660] ? kernel_init+0xa/0x106 > [ 3.556660] ? ret_from_fork+0x22/0x40 > [ 3.556660] Rebooting in 10 seconds.. > [ 3.556660] ACPI MEMORY or I/O RESET_REG. > > I modified the above code, and tested it. This can find out the system ram in > the low 1MiB range. And it worked well. > > @@ -356,11 +338,11 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params) > memset(&cmd, 0, sizeof(struct crash_memmap_data)); > cmd.params = params; > > + /* Add the low 1MiB */ > + cmd.type = E820_TYPE_RAM; > + flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY; > + walk_iomem_res_desc(IORES_DESC_NONE, flags, 0, 1024*1024 - 1, &cmd, > + memmap_entry_callback); > That looks like a very reasonable fix. >> Depending on which bugs exist it might make sense to limit this to >> the low 640KiB. But finding something the kernel already recognizes >> as RAM should prevent most of those problems already. Barring bugs >> I admit it doesn't make sense to repeat the work that someone else >> has already done. >> >> This bit: >> /* Add e820 reserved ranges */ >> cmd.type = E820_TYPE_RESERVED; >> flags = IORESOURCE_MEM; >> walk_iomem_res_desc(IORES_DESC_RESERVED, flags, 0, -1, &cmd, >> memmap_entry_callback); >> >> Should probably start at 1MiB instead of 0. Just so we don't report the > If so, it can not find out the reserved memory marked as 'IORES_DESC_RESERVED' in > the low 1MiB range, finally, it doesn't pass the reserved memory in the low 1MiB to > kdump kernel, which could cause some problems, such as SME or PCI MMCONFIG issue. Good point. For some reason I was thinking IORESOURCE_MEM and IORESOURCE_SYSTEM_RAM were the same thing. It has been way to long since I have been in that part of the code. So yes let's leave that part alone. >> memory below 1MiB as unconditionally reserved. I don't properly >> understand the IORES_DESC_RESERVED flag, and how that differs from > I found three commits about 'IORES_DESC_RESERVED' flag, hope this helps. > 1.ae9e13d621d6 ("x86/e820, ioport: Add a new I/O resource descriptor IORES_DESC_RESERVED") > 2.5da04cc86d12 ("x86/mm: Rework ioremap resource mapping determination") > 3.980621daf368 ("x86/crash: Add e820 reserved ranges to kdump kernel's e820 table") > >> flags. So please test my suggestions to verify the code works as >> expected. >> > I have tested the two changes that you mentioned, please refer to the > reply above. Thank you. It looks like you have figured out how these things should work. Eric _______________________________________________ kexec mailing list kexec@xxxxxxxxxxxxxxxxxxx http://lists.infradead.org/mailman/listinfo/kexec