Hello Boris,
On 4/24/2024 9:48 AM, Borislav Petkov wrote:
On Mon, Apr 15, 2024 at 11:22:58PM +0000, Ashish Kalra wrote:
From: Ashish Kalra <ashish.kalra@xxxxxxx>
For kexec use case, need to use and stick to the EFI memmap passed
from the first kernel via boot-params/setup data, hence,
skip efi_arch_mem_reserve() during kexec.
Please use this or similar scheme when formulating your commit messages.
This above is too laconic.
1. Prepare the context for the explanation briefly.
2. Explain the problem at hand.
3. "It happens because of <...>"
4. "Fix it by doing X"
5. "(Potentially do Y)."
And some of those above are optional depending on the issue being
explained.
For more detailed info, see
Documentation/process/submitting-patches.rst,
Section "2) Describe your changes".
Here is the more detailed description of the issue:
With SNP guest kexec and during nested guest kexec, observe the
following efi memmap corruption :
[ 0.000000] efi: EFI v2.7 by EDK II^M
[ 0.000000] efi: SMBIOS=0x7e33f000 SMBIOS 3.0=0x7e33d000
ACPI=0x7e57e000 ACPI 2.0=0x7e57e014 MEMATTR=0x7cc3c018
Unaccepted=0x7c09e018 ^M
[ 0.000000] efi: [Firmware Bug]: Invalid EFI memory map entries:^M
[ 0.000000] efi: mem03: [type=269370880|attr=0x0e42100e42180e41]
range=[0x0486200e41038c18-0x200e898a0eee713ac17] (invalid)^M
[ 0.000000] efi: mem04: [type=12336|attr=0x0e410686300e4105]
range=[0x100e420000000176-0x8c290f26248d200e175] (invalid)^M
[ 0.000000] efi: mem06: [type=1124304408|attr=0x000030b400000028]
range=[0x0e51300e45280e77-0xb44ed2142f460c1e76] (invalid)^M
[ 0.000000] efi: mem08: [type=68|attr=0x300e540583280e41]
range=[0x0000011affff3cd8-0x486200e54b38c0bcd7] (invalid)^M
[ 0.000000] efi: mem10: [type=1107529240|attr=0x0e42280e41300e41]
range=[0x300e41058c280e42-0x38010ae54c5c328ee41] (invalid)^M
[ 0.000000] efi: mem11: [type=189335566|attr=0x048d200e42038e18]
range=[0x0000318c00000048-0xe42029228ce4200047] (invalid)^M
[ 0.000000] efi: mem12: [type=239142534|attr=0x0000002400000b4b]
range=[0x0e41380e0a7d700e-0x80f26238f22bfe500d] (invalid)^M
[ 0.000000] efi: mem14: [type=239207055|attr=0x0e41300e43380e0a]
range=[0x8c280e42048d200e-0xc70b028f2f27cc0a00d] (invalid)^M
[ 0.000000] efi: mem15: [type=239210510|attr=0x00080e660b47080e]
range=[0x0000324c0000001c-0xa78028634ce490001b] (invalid)^M
[ 0.000000] efi: mem16: [type=4294848528|attr=0x0000329400000014]
range=[0x0e410286100e4100-0x80f252036a218f20ff] (invalid)^M
[ 0.000000] efi: mem19: [type=2250772033|attr=0x42180e42200e4328]
range=[0x41280e0ab9020683-0xe0e538c28b39e62682] (invalid)^M
[ 0.000000] efi: mem20: [type=16| | | | | | | | | | |WB|
|WC| ] range=[0x00000008ffff4438-0xffff44340090333c437] (invalid)^M
[ 0.000000] efi: mem22: [Reserved |attr=0x000000c1ffff4420]
range=[0xffff442400003398-0x1033a04240003f397] (invalid)^M
[ 0.000000] efi: mem23: [type=1141080856|attr=0x080e41100e43180e]
range=[0x280e66300e4b280e-0x440dc5ee7141f4c080d] (invalid)^M
[ 0.000000] efi: mem25: [Reserved |attr=0x0000000affff44a0]
range=[0xffff44a400003428-0x1034304a400013427] (invalid)^M
[ 0.000000] efi: mem28: [type=16| | | | | | | | | | |WB|
|WC| ] range=[0x0000000affff4488-0xffff448400b034bc487] (invalid)^M
[ 0.000000] efi: mem30: [Reserved |attr=0x0000000affff4470]
range=[0xffff447400003518-0x10352047400013517] (invalid)^M
[ 0.000000] efi: mem33: [type=16| | | | | | | | | | |WB|
|WC| ] range=[0x0000000affff4458-0xffff445400b035ac457] (invalid)^M
[ 0.000000] efi: mem35: [type=269372416|attr=0x0e42100e42180e41]
range=[0x0486200e44038c18-0x200e8b8a0eee823ac17] (invalid)^M
[ 0.000000] efi: mem37: [type=2351435330|attr=0x0e42100e42180e42]
range=[0x470783380e410686-0x2002b2a041c2141e685] (invalid)^M
[ 0.000000] efi: mem38: [type=1093668417|attr=0x100e420000000270]
range=[0x42100e42180e4220-0xfff366a4e421b78c21f] (invalid)^M
[ 0.000000] efi: mem39: [type=76357646|attr=0x180e42200e42280e]
range=[0x0e410686300e4105-0x4130f251a0710ae5104] (invalid)^M
[ 0.000000] efi: mem40: [type=940444268|attr=0x0e42200e42280e41]
range=[0x180e42200e42280e-0x300fc71c300b4f2480d] (invalid)^M
[ 0.000000] efi: mem41: [MMIO |attr=0x8c280e42048d200e]
range=[0xffff479400003728-0x42138e0c87820292727] (invalid)^M
[ 0.000000] efi: mem42: [type=1191674680|attr=0x0000004c0000000b]
range=[0x300e41380e0a0246-0x470b0f26238f22b8245] (invalid)^M
[ 0.000000] efi: mem43: [type=2010|attr=0x0301f00e4d078338]
range=[0x45038e180e42028f-0xe4556bf118f282528e] (invalid)^M
[ 0.000000] efi: mem44: [type=1109921345|attr=0x300e44000000006c]
range=[0x44080e42100e4218-0xfff39254e42138ac217] (invalid)^M
[ 0.000000] efi: mem45: [type=40|attr=0x0e41100e41180e0a]
range=[0x0000008affff5228-0x4702400e53b3830d227] (invalid)^M
[ 0.000000] efi: mem47: [type=1107529240|attr=0x42280e41300e4138]
range=[0x300e44058c280e42-0xe0d049a435c728ee41] (invalid)^M
...
This EFI memap corruption is happening during efi_arch_mem_reserve()
invocation with the previous kexec-ed kernel boot.
( efi_arch_mem_reserve() is invoked with the following call-stack: )
[ 0.310010] efi_arch_mem_reserve+0xb1/0x220^M
[ 0.310686] ? memblock_add_range+0x2a0/0x2e0^M
[ 0.311382] efi_mem_reserve+0x36/0x60^M
[ 0.311973] efi_bgrt_init+0x17d/0x1a0^M
[ 0.312565] ? __pfx_acpi_parse_bgrt+0x10/0x10^M
[ 0.313265] acpi_parse_bgrt+0x12/0x20^M
[ 0.313858] acpi_table_parse+0x77/0xd0^M
[ 0.314463] acpi_boot_init+0x362/0x630^M
[ 0.315069] setup_arch+0xa88/0xf80^M
[ 0.315629] start_kernel+0x68/0xa90^M
[ 0.316194] x86_64_start_reservations+0x1c/0x30^M
[ 0.316921] x86_64_start_kernel+0xbf/0x110^M
[ 0.317582] common_startup_64+0x13e/0x141^M
[ 0.318231] </TASK>^M
Now, efi_arch_mem_reserve() calls efi_memmap_alloc() to allocate memory
for EFI memory map and due to early allocation it uses memblock allocation.
Later in the boot flow, efi_enter_virtual_mode() calls
kexec_enter_virtual_mode() in case of a kexec-ed kernel boot.
This function kexec_enter_virtual_mode() installs the new EFI memory map
by calling efi_memmap_init_late() which remaps the efi_memmap physically
allocated above in efi_arch_mem_reserve(), but please note that this
remapping is still using memblock allocation.
Subsequently, when memblock is freed later in boot flow, the above
remapped efi_memmap will have random corruption (similar to a
use-after-free scenario).
This corrupted EFI memory map is then passed to the next kexec-ed kernel
which causes a panic when trying to use the corrupted EFI memory map.
Additionally during SNP guest kexec testing discovered that EFI memmap
is corrupted during chained kexec.
That sentence needs sanitization.
kexec_enter_virtual_mode() during late init will remap the efi_memmap
physical pages allocated in efi_arch_mem_reserve() via memblock & then
s/&/and/
This is not code. Please take a greater care when writing commit
messages - they're not write-only.
subsequently cause random EFI memmap corruption once memblock is
freed/teared-down.
"torn down"
Suggested-by: Dave Young <dyoung@xxxxxxxxxx>
[Dave Young: checking the md attribute instead of checking the efi_setup]
Signed-off-by: Ashish Kalra <ashish.kalra@xxxxxxx>
---
arch/x86/platform/efi/quirks.c | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)
diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
index f0cc00032751..982f5e50a4b3 100644
--- a/arch/x86/platform/efi/quirks.c
+++ b/arch/x86/platform/efi/quirks.c
@@ -258,12 +258,28 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
int num_entries;
void *new;
- if (efi_mem_desc_lookup(addr, &md) ||
- md.type != EFI_BOOT_SERVICES_DATA) {
+ /*
+ * For kexec use case, we need to use the EFI memmap passed from the first
Make all your text impersonal - no "we", "I", etc.
+ * kernel via setup data, so we need to skip this.
What exactly do we need to skip?
If the EFI memory descriptor lookup fails?
+ * Additionally kexec_enter_virtual_mode() during late init will remap
+ * the efi_memmap physical pages allocated here via memboot & then
+ * subsequently cause random EFI memmap corruption once memblock is freed.
+ */
Why is that comment here and what is its relevance to the line it is
above of?
+ if (efi_mem_desc_lookup(addr, &md)) {
pr_err("Failed to lookup EFI memory descriptor for %pa\n", &addr);
return;
}
+ if (md.type != EFI_BOOT_SERVICES_DATA) {
+ pr_err("Skip reserving non EFI Boot Service Data memory for %pa\n", &addr);
What is this pr_err() useful for?
+ return;
+ }
+
+ /* Kexec copied the efi memmap from the first kernel, thus skip the case */
kexec? This is a generic function - what does it have to do with kexec?
The subject of this patch is:
Subject: [PATCH v5 1/3] efi/x86: skip efi_arch_mem_reserve() in case of kexec
and yet, nothing skips this function - it adds a bunch of checks,
printks and early returns with the intent that those early returns
happen on kexec and thus the actual memremap doesn't happen there.
So it is some sort of: let's check things which will be true in
a kexec-ed kernel and thus avoid the function by returning early.
But I have no clue.
It sounds to me like you need to go back up, to the 10000ft view and
explain how exactly this efi_mem_reserve() causes trouble for the
kexec-ed kernel so that we can think of a proper solution, not some
random hackery.
The above details explain why and how efi_arch_mem_reserve() causes
trouble for the (nested) kexec-ed kernel, additionally, there is a
another reason to skip efi_arch_mem_reserve() altogether for the kexec
case, as for kexec use case we need to use the EFI memmap passed from
the 1st kernel via setup_data and probably need to avoid any additional
EFI memory map additions/updates.
Therefore, the first revision of this patch had the following code to
skip efi_arch_mem_reserve():
void __init efi_arch_mem_reserve(..) {
+ if (efi_setup) + return;
But then based on upstream review/feedback, the second revision of this
patch, updated the patch to check the md attribute of the EFI memory
descriptor instead of checking for efi_setup for detecting if running
under kexec kernel and the checking of the md attribute of the EFI
memory descriptor introduces these additional checks and pr_err() which
you commented on above.
Hopefully, the above detailed explanation captures the reason to skip
efi_arch_mem_reserve() in case of (SNP) guest kexec, looking forward to
your comments/feedback on the same for me to rework this patch
(especially the commit message) and post it again.
Thanks, Ashish
_______________________________________________
kexec mailing list
kexec@xxxxxxxxxxxxxxxxxxx
http://lists.infradead.org/mailman/listinfo/kexec