On 10/09/2024 15:26, Eric W. Biederman wrote: > Breno Leitao <leitao@xxxxxxxxxx> writes: > >> We've seen a problem in upstream kernel kexec, where a EFI TPM log event table >> is being overwritten. This problem happen on real machine, as well as in a >> recent EDK2 qemu VM. >> >> Digging deep, the table is being overwritten during kexec, more precisely when >> relocating kernel (relocate_kernel() function). >> >> I've also found that the table is being properly reserved using >> memblock_reserve() early in the boot, and that range gets overwritten later in >> by relocate_kernel(). In other words, kexec is overwriting a memory that was >> previously reserved (as memblock_reserve()). >> >> Usama found that kexec only honours memory reservations from /sys/firmware/memmap >> which comes from e820_table_firmware table. >> >> Looking at the TPM spec, I found the following part: >> >> If the ACPI TPM2 table contains the address and size of the Platform Firmware TCG log, >> firmware “pins” the memory associated with the Platform Firmware TCG log, and reports >> this memory as “Reserved” memory via the INT 15h/E820 interface. >> >> >> From: https://trustedcomputinggroup.org/wp-content/uploads/PC-ClientPlatform_Profile_for_TPM_2p0_Systems_v49_161114_public-review.pdf >> >> I am wondering if that memory region/range should be part of e820 table that is >> passed by EFI firmware to kernel, and if it is not passed (as it is not being >> passed today), then the kernel doesn't need to respect it, and it is free to >> overwrite (as it does today). In other words, this is a firmware bug and not a >> kernel bug. >> >> Am I missing something? > > I agree that this appears to be a firmware bug. This memory is reserved > in one location and not in another location. > > That said that doesn't mean we can't deal with it in the kernel. > acpi_table_upgrade seems to have hit a similar issue issue and calls > arch_reserve_mem_area to reserve the area in the e820tables. > > > The last time I looked the e820 tables (in the kernel) are used to store > the efi memory map when available and only use the true e820 data on > older systems. > > Which is a long way of say that the e820 table in the kernel last I > looked was the master table, of how the firmware views the memory. > > > As I recall the memblock allocator is the bootstrap memory allocator > used when bringing up the kernel. So I don't see reserving something > in the memblock allocator as being authoritative as to how the firmware > has setup memory. > > > > I would suggest writing a patch to update whatever is calling > memblock_reserve to also, or perhaps in preference to update the e820 > map. If the code is not x86 specific I would suggest using ACPI's > arch_reserve_mem_area call. > Thanks, I have sent a potential fix for this at https://lore.kernel.org/all/20240911104109.1831501-1-usamaarif642@xxxxxxxxx/ We can see this issue in kernels going all the way back to 5.12. Up until now it only corrupted the tpm_log version, so it wasn't really an issue. After upgrading production to 6.9, the tpm_log size has started to get corrupted as well. When size was corrupted to a negative value, the memblock_reserve in efi_tpm_eventlog_init is reserving the entire memory available, and the system OOMs at boot time, which is causing a serious issue. It would be good to know if the above patch is an acceptable fix. Thanks! Usama