Re: [REGRESSION] kexec does firmware reboot in kernel v6.7.6

"Eric W. Biederman" <ebiederm@xxxxxxxxxxxx> · Wed, 13 Mar 2024 07:16:23 -0500

Steve Wahl <steve.wahl@xxxxxxx> writes:

> [*really* added kexec maintainers this time.]
>
> Full thread starts here:
> https://lore.kernel.org/all/3a1b9909-45ac-4f97-ad68-d16ef1ce99db@xxxxxxxxxxxxxxx/
>
> On Wed, Mar 13, 2024 at 12:12:31AM +0530, Pavin Joseph wrote:
>> On 3/12/24 20:43, Steve Wahl wrote:
>> > But I don't want to introduce a new command line parameter if the
>> > actual problem can be understood and fixed.  The question is how much
>> > time do I have to persue a direct fix before some other action needs
>> > to be taken?
>> 
>> Perhaps the kexec maintainers [0] can be made aware of this and you could
>> coordinate with them on a potential fix?
>> 
>> Currently maintained by
>> P:      Simon Horman
>> M:      horms@xxxxxxxxxxxx
>> L:      kexec@xxxxxxxxxxxxxxxxxxx
>
> Probably a good idea to add kexec people to the list, so I've added
> them to this email.
>
> Everyone, my recent patch to the kernel that changed identity mapping:
>
> 7143c5f4cf2073193 x86/mm/ident_map: Use gbpages only where full GB page should be mapped.
>
> ... has broken kexec on a few machines.  The symptom is they do a full
> BIOS reboot instead of a kexec of the new kernel.  Seems to be limited
> to AMD processors, but it's not all AMD processors, probably just some
> characteristic that they happen to share.
>
> The same machines that are broken by my patch, are also broken in
> previous kernels if you add "nogbpages" to the kernel command line
> (which makes the identity map bigger, "nogbpages" doing for all parts
> of the identity map what my patch does only for some parts of it).
>
> I'm still hoping to find a machine I can reproduce this on to try and
> debug it myself.
>
> If any of you have any assistance or advice to offer, it would be most
> welcome!

Kexec happens on identity mapped page tables.

The files of interest are machine_kexec_64.c and relocate_kernel_64.S

I suspect either the building of the identity mappged page table in
machine_kexec_prepare, or the switching to the page table in
identity_mapped in relocate_kernel_64.S is where something goes wrong.

Probably in kernel_ident_mapping_init as that code is directly used
to build the identity mapped page tables.

Hmm.

Your change is commit d794734c9bbf ("x86/mm/ident_map: Use gbpages only
where full GB page should be mapped.")

Given the simplicity of that change itself my guess is that somewhere in
the first 1Gb there are pages that needed to be mapped like the idt at 0
that are not getting mapped.

Reading through the changelog:
>   x86/mm/ident_map: Use gbpages only where full GB page should be mapped.
>    
>    When ident_pud_init() uses only gbpages to create identity maps, large
>    ranges of addresses not actually requested can be included in the
>    resulting table; a 4K request will map a full GB.  On UV systems, this
>    ends up including regions that will cause hardware to halt the system
>    if accessed (these are marked "reserved" by BIOS).  Even processor
>    speculation into these regions is enough to trigger the system halt.
>    
>    Only use gbpages when map creation requests include the full GB page
>    of space.  Fall back to using smaller 2M pages when only portions of a
>    GB page are included in the request.
>    
>    No attempt is made to coalesce mapping requests. If a request requires
>    a map entry at the 2M (pmd) level, subsequent mapping requests within
>    the same 1G region will also be at the pmd level, even if adjacent or
>    overlapping such requests could have been combined to map a full
>    gbpage.  Existing usage starts with larger regions and then adds
>    smaller regions, so this should not have any great consequence.
>    
>    [ dhansen: fix up comment formatting, simplifty changelog ]
>    
>    Signed-off-by: Steve Wahl <steve.wahl@xxxxxxx>
>    Signed-off-by: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
>    Cc: stable@xxxxxxxxxxxxxxx
>    Link: https://lore.kernel.org/all/20240126164841.170866-1-steve.wahl%40hpe.com

I know historically that fixed mtrrs were used so that the first 1GiB
could be covered with page tables and cause problems.

I suspect whatever those UV systems are more targeted solution would be
to use the fixed mtrrs to disable caching and speculation on the
problematic ranges rather than completely changing the fundamental logic
of how pages are mapped.

Right now it looks like you get to play a game of whack-a-mole with
firmware/BIOS tables that don't mention something important, and
ensuring the kernel maps everything important in the first 1GiB.

It might be worth setting up early printk on some of these systems
and seeing if the failure is in early boot up of the new kernel (that is
using kexec supplied identity mapped pages) rather than in kexec per-se.

But that is just my guess at the moment.

Eric

>> I hope the root cause can be fixed instead of patching it over with a flag
>> to suppress the problem, but I don't know how regressions are handled here.
>
> That would be my preference as well.
>
> Thanks,
>
> --> Steve Wahl

_______________________________________________
kexec mailing list
kexec@xxxxxxxxxxxxxxxxxxx
http://lists.infradead.org/mailman/listinfo/kexec