Re: [REGRESSION] kexec does firmware reboot in kernel v6.7.6

"Kalra, Ashish" <ashish.kalra@xxxxxxx> · Thu, 24 Oct 2024 05:08:11 -0500

Hello David,

On 10/23/2024 8:50 AM, David Woodhouse wrote:
> On Wed, 2024-10-23 at 08:29 -0500, Kalra, Ashish wrote:
>>
>> On 10/23/2024 6:39 AM, David Woodhouse wrote:
>>> On Wed, 2024-10-23 at 06:07 -0500, Kalra, Ashish wrote:
>>>>
>>>> As mentioned above, about the same 2MB page containing the end portion of the RMP table and a page allocated for kexec and 
>>>> looking at the e820 memory map dump here: 
>>>>
>>>>>>> [    0.000000] BIOS-e820: [mem 0x000000bfbe000000-0x000000c1420fffff] reserved
>>>>>>> [    0.000000] BIOS-e820: [mem 0x000000c142100000-0x000000fc7fffffff] usable
>>>>
>>>> As seen here in the e820 memory map, the end range of the RMP table is not
>>>> aligned to 2MB and not reserved but it is usable as RAM.
>>>>
>>>> Subsequently, kexec-ed kernel could try to allocate from within that chunk
>>>> which then causes a fatal RMP fault.
>>>
>>> Well, allocating within that chunk would be just fine. It *is* usable
>>> as RAM, as the e820 table says. It works fine most of the time.
>>>
>>> You've missed a step out of the story. The problem is that for kexec we
>>> map it with an "overreaching" 2MiB PTE which also covers the reserved
>>> regions, and *that* is what causes the RMP violation fault.
>>>
>>
>> Actually, the RMP entry covering the end range of the RMP table will be a 2MB/large entry 
>> which means that the whole 2MB including the usable 1MB memory range here will also be marked
>> as reserved in the RMP table and hence any host writes into this memory range will trigger
>> the RMP violation.
> 
> Hm, that does not match our testing. We tried writing to the
> "offending" area from the main kernel (which I assume was using 4KiB
> pages for it, but didn't verify), and that was fine. 
> 
> It also doesn't match what Tom says in the email you linked to:
> 
> "There's no requirement from a hardware/RMP usage perspective that 
> requires a 2MB alignment, so BIOS is not doing anything wrong.  The 
> problem occurs because kexec is initially using 2MB mappings that 
> overlap the start and/or end of the RMP which then results in an RMP 
> fault when memory within one of those 2MB mappings, that is not part of
> the RMP, is referenced."
> 
> Tom's words precisely match my understanding of the situation (with the
> exception that he keeps saying 2MB when he means 2MiB).
> 
> I believe we *can* use that extra 1MiB which is marked as 'usable RAM'
> as usable RAM if we want to, as *long* as we don't use a 2MiB (or
> larger) PTE for it which would overlap the RMP table.
> 
> And the only case where the kernel uses an "overreaching" 2MiB mapping
> is the kexec identmap code, so we should just fix that.

Here is a more *correct* explanation of the issue after discussing it with 
Tom: 

The RMP entries for the RMP table memory are marked as firmware pages
(meaning they are assigned and immutable - with the key point being
the assigned bit is set). If the start or end of the RMP table is not
2MB aligned, then the RMP entries are broken down into 4k entries.
Using a 2MB page table mapping (kexec identmap code using a 2MiB PTE),
if the kernel tries to access the portion of the memory that is within
the 2MB page but is not the RMP table, an RMP fault will be generated
because the mappings don't match.

This is documented in AMD64 Architecture Programmer's Manual Volume 2,
section 15.36.10 - RMP and VMPL Access Checks.

As this RMP entry here is covering the RMP table and usable memory range,
so it needs to be smashed to have 4k entries.

So, the RMP fault is being generated here because of the page size mapping
mismatch between the RMP entry and the page table entry.

> 
>>> We could take two possible viewpoints here. I was taking the viewpoint
>>> that this is a kernel bug, that it *shouldn't* be setting up 2MiB pages
>>> which include a reserved region, and should break those down to 4KiB
>>> pages.
>>>
>>> The alternative view would be to consider it a BIOS bug, and to say
>>> that the BIOS really *ought* to have reserved the whole 2MiB region to
>>> avoid the 'sharing'.  Since the hardware apparently already breaks down
>>> 1GiB pages to 2MiB TLB entries in order to avoid triggering the problem
>>> on 1GiB mappings.
>>>
>>>> This issue has been fixed with the following patch: 
>>>> https://lore.kernel.org/lkml/171438476623.10875.16783275868264913579.tip-bot2@tip-bot2/
>>>
>>> Thanks for pointing that patch out! Should it have been Cc:stable?
>>>
>>
>> This thing can happen after SNP host support got merged in 6.11 and SNP support is enabled, therefore
>> the patch does not mark it Cc:stable.
>>
>> I am trying to understand the scenario here: you have SNP enabled in the BIOS and you also
>> have SNP support added in the host kernel, which means that the following logs are seen:
>> ..
>> SEV-SNP: RMP table physical range [0x000000xxxxxxxxxx - 0x000000yyyyyyyyyy]
>> ..
> 
> Ah yes. SEV-SNP isn't actually being *used* on these Genoa platforms at
> the moment, but I do think it's enabled in the kernel.
> 
> If this problem only happens when the kernel actually *enables* SEV-
> SNP, then it seems this fix was missed in our backporting of SEV-SNP
> support to, ahem, a slightly older kernel.
> 

Yes, this problem only happens when kernel enables SEV-SNP support 
and SNP_INIT_EX has been done by the CCP driver to initialize SEV-SNP support.

> But I still don't like it :)
> 
>>> It seems to be taking the latter of the above two viewpoints, that this
>>> is a BIOS bug and that the BIOS *should* have reserved the whole 2MiB.
>>>
>>> In that case are fixed BIOSes available already? 
>>
>> We have been of the view that it is easier to get it fixed in kernel, by fixing/aligning the e820 range
>> mapping the start and end of RMP table to 2MB boundaries, rather than trusting a BIOS to do it
>> correctly. 
>>
>> Here is a link to a discussion on the same:
>> https://lore.kernel.org/all/2ab14f6f-2690-056b-cf9e-38a12dafd728@xxxxxxx/
> 
> As noted above, that message clearly states that the BIOS isn't doing
> anything wrong, and the problem is the kernel using large page mappings
> that overlap reserved ranges.
> 
> In that case, shouldn't we fix the kernel *not* to do that?
> 
> I suppose we can be OK with "let's just avoid using that memory to
> workaround the kexec/identmap bug", but in that case let's not claim
> that we're working around a BIOS bug?
> 

Yes, this is the approach we are taking currently to workaround the kexec/identmap bug with the above patch.

Do note, we *need* to do the e820 memory map fixups and additionally do memblock_reserve() to ensure that this
usable part of memory adjacent to the RMP table does not get allocated to guests, otherwise it causes 
RMPUPDATE on this range of memory to fail, fixed with the following patch: 

https://lore.kernel.org/lkml/172968164814.1442.8035313578482871705.tip-bot2@tip-bot2/

Thanks,
Ashish