Hi Ard, Akashi, On Wed, Dec 13, 2017 at 5:47 PM, Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> wrote: > On 13 December 2017 at 12:16, AKASHI Takahiro > <takahiro.akashi@xxxxxxxxxx> wrote: >> On Wed, Dec 13, 2017 at 10:49:27AM +0000, Ard Biesheuvel wrote: >>> On 13 December 2017 at 10:26, AKASHI Takahiro >>> <takahiro.akashi@xxxxxxxxxx> wrote: >>> > Bhupesh, Ard, >>> > >>> > On Wed, Dec 13, 2017 at 03:21:59AM +0530, Bhupesh Sharma wrote: >>> >> Hi Ard, Akashi >>> >> >>> > (snip) >>> > >>> >> Looking deeper into the issue, since the arm64 kexec-tools uses the >>> >> 'linux,usable-memory-range' dt property to allow crash dump kernel to >>> >> identify its own usable memory and exclude, at its boot time, any >>> >> other memory areas that are part of the panicked kernel's memory. >>> >> (see https://www.kernel.org/doc/Documentation/devicetree/bindings/chosen.txt >>> >> , for details) >>> > >>> > Right. >>> > >>> >> 1). Now when 'kexec -p' is executed, this node is patched up only >>> >> with the crashkernel memory range: >>> >> >>> >> /* add linux,usable-memory-range */ >>> >> nodeoffset = fdt_path_offset(new_buf, "/chosen"); >>> >> result = fdt_setprop_range(new_buf, nodeoffset, >>> >> PROP_USABLE_MEM_RANGE, &crash_reserved_mem, >>> >> address_cells, size_cells); >>> >> >>> >> (see https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/tree/kexec/arch/arm64/kexec-arm64.c#n465 >>> >> , for details) >>> >> >>> >> 2). This excludes the ACPI reclaim regions irrespective of whether >>> >> they are marked as System RAM or as RESERVED. As, >>> >> 'linux,usable-memory-range' dt node is patched up only with >>> >> 'crash_reserved_mem' and not 'system_memory_ranges' >>> >> >>> >> 3). As a result when the crashkernel boots up it doesn't find this >>> >> ACPI memory and crashes while trying to access the same: >>> >> >>> >> # kexec -p /boot/vmlinuz-`uname -r` --initrd=/boot/initramfs-`uname >>> >> -r`.img --reuse-cmdline -d >>> >> >>> >> [snip..] >>> >> >>> >> Reserved memory range >>> >> 000000000e800000-000000002e7fffff (0) >>> >> >>> >> Coredump memory ranges >>> >> 0000000000000000-000000000e7fffff (0) >>> >> 000000002e800000-000000003961ffff (0) >>> >> 0000000039d40000-000000003ed2ffff (0) >>> >> 000000003ed60000-000000003fbfffff (0) >>> >> 0000001040000000-0000001ffbffffff (0) >>> >> 0000002000000000-0000002ffbffffff (0) >>> >> 0000009000000000-0000009ffbffffff (0) >>> >> 000000a000000000-000000affbffffff (0) >>> >> >>> >> 4). So if we revert Ard's patch or just comment the fixing up of the >>> >> memory cap'ing passed to the crash kernel inside >>> >> 'arch/arm64/mm/init.c' (see below): >>> >> >>> >> static void __init fdt_enforce_memory_region(void) >>> >> { >>> >> struct memblock_region reg = { >>> >> .size = 0, >>> >> }; >>> >> >>> >> of_scan_flat_dt(early_init_dt_scan_usablemem, ®); >>> >> >>> >> if (reg.size) >>> >> //memblock_cap_memory_range(reg.base, reg.size); /* >>> >> comment this out */ >>> >> } >>> > >>> > Please just don't do that. It can cause a fatal damage on >>> > memory contents of the *crashed* kernel. >>> > >>> >> 5). Both the above temporary solutions fix the problem. >>> >> >>> >> 6). However exposing all System RAM regions to the crashkernel is not >>> >> advisable and may cause the crashkernel or some crashkernel drivers to >>> >> fail. >>> >> >>> >> 6a). I am trying an approach now, where the ACPI reclaim regions are >>> >> added to '/proc/iomem' separately as ACPI reclaim regions by the >>> >> kernel code and on the other hand the user-space 'kexec-tools' will >>> >> pick up the ACPI reclaim regions from '/proc/iomem' and add it to the >>> >> dt node 'linux,usable-memory-range' >>> > >>> > I still don't understand why we need to carry over the information >>> > about "ACPI Reclaim memory" to crash dump kernel. In my understandings, >>> > such regions are free to be reused by the kernel after some point of >>> > initialization. Why does crash dump kernel need to know about them? >>> > >>> >>> Not really. According to the UEFI spec, they can be reclaimed after >>> the OS has initialized, i.e., when it has consumed the ACPI tables and >>> no longer needs them. Of course, in order to be able to boot a kexec >>> kernel, those regions needs to be preserved, which is why they are >>> memblock_reserve()'d now. >> >> For my better understandings, who is actually accessing such regions >> during boot time, uefi itself or efistub? >> > > No, only the kernel. This is where the ACPI tables are stored. For > instance, on QEMU we have > > ACPI: RSDP 0x0000000078980000 000024 (v02 BOCHS ) > ACPI: XSDT 0x0000000078970000 000054 (v01 BOCHS BXPCFACP 00000001 > 01000013) > ACPI: FACP 0x0000000078930000 00010C (v05 BOCHS BXPCFACP 00000001 > BXPC 00000001) > ACPI: DSDT 0x0000000078940000 0011DA (v02 BOCHS BXPCDSDT 00000001 > BXPC 00000001) > ACPI: APIC 0x0000000078920000 000140 (v03 BOCHS BXPCAPIC 00000001 > BXPC 00000001) > ACPI: GTDT 0x0000000078910000 000060 (v02 BOCHS BXPCGTDT 00000001 > BXPC 00000001) > ACPI: MCFG 0x0000000078900000 00003C (v01 BOCHS BXPCMCFG 00000001 > BXPC 00000001) > ACPI: SPCR 0x00000000788F0000 000050 (v02 BOCHS BXPCSPCR 00000001 > BXPC 00000001) > ACPI: IORT 0x00000000788E0000 00007C (v00 BOCHS BXPCIORT 00000001 > BXPC 00000001) > > covered by > > efi: 0x0000788e0000-0x00007894ffff [ACPI Reclaim Memory ...] > ... > efi: 0x000078970000-0x00007898ffff [ACPI Reclaim Memory ...] > > >>> So it seems that kexec does not honour the memblock_reserve() table >>> when booting the next kernel. >> >> not really. >> >>> > (In other words, can or should we skip some part of ACPI-related init code >>> > on crash dump kernel?) >>> > >>> >>> I don't think so. And the change to the handling of ACPI reclaim >>> regions only revealed the bug, not created it (given that other >>> memblock_reserve regions may be affected as well) >> >> As whether we should honor such reserved regions over kexec'ing >> depends on each one's specific nature, we will have to take care one-by-one. >> As a matter of fact, no information about "reserved" memblocks is >> exposed to user space (via proc/iomem). >> > > That is why I suggested (somewhere in this thread?) to not expose them > as 'System RAM'. Do you think that could solve this? I agree. So how about my proposal (please see my last reply) - to expose these regions as "ACPI reclaim regions" in /proc/iomem. Please note that we already have several instances where the driver regions are already explicitly labelled by different concise names across /proc/iomem, for e.g.: # cat /proc/iomem | grep -i serial 1c021000-1c02101f : serial If we expose only the ACPI reclaim regions to the crashkernel (along with the normal crash kernel memory range), we avoid exposing all System RAM or reserved regions to the crashkernel which may cause issues with crashkernel boot or crash coredump save operations. And we can also accordingly modify the 'kexec-tools' to pick these regions along with the normal crash kernel memory range and append them to the 'linux,usable-memory-range' dt node, so that the crash kernel can operate on them. If you think this ok, I can try to send a RFC patch later this week. Please let me know. Regards, Bhupesh >>> >>> >> 6b). The kernel code currently looks like the following: >>> >> >>> >> diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c >>> >> index 30ad2f085d1f..867bdec7c692 100644 >>> >> --- a/arch/arm64/kernel/setup.c >>> >> +++ b/arch/arm64/kernel/setup.c >>> >> @@ -206,6 +206,7 @@ static void __init request_standard_resources(void) >>> >> { >>> >> struct memblock_region *region; >>> >> struct resource *res; >>> >> + phys_addr_t addr_start, addr_end; >>> >> >>> >> kernel_code.start = __pa_symbol(_text); >>> >> kernel_code.end = __pa_symbol(__init_begin - 1); >>> >> @@ -218,9 +219,17 @@ static void __init request_standard_resources(void) >>> >> res->name = "reserved"; >>> >> res->flags = IORESOURCE_MEM; >>> >> } else { >>> >> - res->name = "System RAM"; >>> >> - res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY; >>> >> + addr_start = >>> >> __pfn_to_phys(memblock_region_reserved_base_pfn(region)); >>> >> + addr_end = >>> >> __pfn_to_phys(memblock_region_reserved_end_pfn(region)) - 1; >>> >> + if ((efi_mem_type(addr_start) == EFI_ACPI_RECLAIM_MEMORY) >>> >> || (efi_mem_type(addr_end) == EFI_ACPI_RECLAIM_MEMORY)) { >>> >> + res->name = "ACPI reclaim region"; >>> >> + res->flags = IORESOURCE_MEM; >>> >> + } else { >>> >> + res->name = "System RAM"; >>> >> + res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY; >>> >> + } >>> >> } >>> >> + >>> >> res->start = __pfn_to_phys(memblock_region_memory_base_pfn(region)); >>> >> res->end = __pfn_to_phys(memblock_region_memory_end_pfn(region)) - 1; >>> >> >>> >> @@ -292,6 +301,7 @@ void __init setup_arch(char **cmdline_p) >>> >> >>> >> request_standard_resources(); >>> >> >>> >> + efi_memmap_unmap(); >>> >> early_ioremap_reset(); >>> >> >>> >> if (acpi_disabled) >>> >> diff --git a/drivers/firmware/efi/arm-init.c b/drivers/firmware/efi/arm-init.c >>> >> index 80d1a885def5..a7c522eac640 100644 >>> >> --- a/drivers/firmware/efi/arm-init.c >>> >> +++ b/drivers/firmware/efi/arm-init.c >>> >> @@ -259,7 +259,6 @@ void __init efi_init(void) >>> >> >>> >> reserve_regions(); >>> >> efi_esrt_init(); >>> >> - efi_memmap_unmap(); >>> >> >>> >> memblock_reserve(params.mmap & PAGE_MASK, >>> >> PAGE_ALIGN(params.mmap_size + >>> >> >>> >> >>> >> After this change the ACPI reclaim regions are properly recognized in >>> >> '/proc/iomem': >>> >> >>> >> # cat /proc/iomem | grep -i ACPI >>> >> 396c0000-3975ffff : ACPI reclaim region >>> >> 39770000-397affff : ACPI reclaim region >>> >> 398a0000-398bffff : ACPI reclaim region >>> >> >>> >> 6c). I am currently changing the 'kexec-tools' and will finish the >>> >> testing over the next few days. >>> >> >>> >> I just wanted to know your opinion on this issue, so that I will be >>> >> able to propose a fix on the above lines. >>> >> >>> >> Also Cc'ing kexec mailing list for more inputs on changes proposed to >>> >> kexec-tools. >>> >> >>> >> Thanks, >>> >> Bhupesh -- To unsubscribe from this list: send the line "unsubscribe linux-efi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html