Re: [REGRESSION] kexec does firmware reboot in kernel v6.7.6

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Tue, 22 Oct 2024 19:51:38 +0100

On Thu, 2024-03-14 at 17:25 +0800, Dave Young wrote:
> On Thu, 14 Mar 2024 at 00:18, Steve Wahl <steve.wahl@xxxxxxx> wrote:
> > 
> > On Wed, Mar 13, 2024 at 07:16:23AM -0500, Eric W. Biederman wrote:
> > > 
> > > Kexec happens on identity mapped page tables.
> > > 
> > > The files of interest are machine_kexec_64.c and relocate_kernel_64.S
> > > 
> > > I suspect either the building of the identity mappged page table in
> > > machine_kexec_prepare, or the switching to the page table in
> > > identity_mapped in relocate_kernel_64.S is where something goes wrong.
> > > 
> > > Probably in kernel_ident_mapping_init as that code is directly used
> > > to build the identity mapped page tables.
> > > 
> > > Hmm.
> > > 
> > > Your change is commit d794734c9bbf ("x86/mm/ident_map: Use gbpages only
> > > where full GB page should be mapped.")
> > 
> > Yeah, sorry, I accidentally used the stable cherry-pick commit id that
> > Pavin Joseph found with his bisect results.
> > 
> > > Given the simplicity of that change itself my guess is that somewhere in
> > > the first 1Gb there are pages that needed to be mapped like the idt at 0
> > > that are not getting mapped.
> > 
> > ...
> > 
> > > It might be worth setting up early printk on some of these systems
> > > and seeing if the failure is in early boot up of the new kernel (that is
> > > using kexec supplied identity mapped pages) rather than in kexec per-se.
> > > 
> > > But that is just my guess at the moment.
> > 
> > Thanks for the input.  I was thinking in terms of running out of
> > memory somewhere because we're using more page table entries than we
> > used to.  But you've got me thinking that maybe some necessary region
> > is not explicitly requested to be placed in the identity map, but is
> > by luck included in the rounding errors when we use gbpages.
> 
> Yes, it is possible. Here is an example case:
> http://lists.infradead.org/pipermail/kexec/2023-June/027301.html
> Final change was to avoid doing AMD things on Intel platform, but the
> mapping code is still not fixed in a good way.

I spent all of Monday setting up a full GDT, IDT and exception handler
for the relocate_kernel() environment¹, and I think these reports may
have been the same as what I've been debugging.

We end up taking a #PF, usually on one of the 'rep mov's, one time on
the 'pushq %r8' right before using it to 'ret' to identity_mapped. In
each case it happens on the first *write* to a page.

Now I can print %cr2 when it happens (instead of just going straight to
triple-fault), I spot an interesting fact about the address. It's
always *adjacent* to a region reserved by BIOS in the e820 data, and
within the same 2MiB page.

[    0.000000] BIOS-e820: [mem 0x000000bfbe000000-0x000000c1420fffff] reserved
[    0.000000] BIOS-e820: [mem 0x000000c142100000-0x000000fc7fffffff] usable

2024-10-22 17:09:14.291000 kern NOTICE [   58.996257] kexec: Control page at c149431000
2024-10-22 17:09:14.291000 Y
2024-10-22 17:09:14.291000 rip:000000c1494312f8
2024-10-22 17:09:14.291000 rsp:000000c149431f90
2024-10-22 17:09:14.291000 Exc:000000000000000e
2024-10-22 17:09:14.291000 Err:0000000080000003
2024-10-22 17:09:14.291000 rax:000000c142130000
2024-10-22 17:09:14.291000 rbx:000000010d4b8020
2024-10-22 17:09:14.291000 rcx:0000000000000200
2024-10-22 17:09:14.291000 rdx:000000000009c000
2024-10-22 17:09:14.291000 rsi:000000000009c000
2024-10-22 17:09:14.291000 rdi:000000c142130000
2024-10-22 17:09:14.291000 r8 :000000c149431000
2024-10-22 17:09:14.291000 r9 :000000c149430000
2024-10-22 17:09:14.291000 r10:000000010d4bc000
2024-10-22 17:09:14.291000 r11:0000000000000000
2024-10-22 17:09:14.291000 r12:0000000000000000
2024-10-22 17:09:14.291000 r13:0000000000770ef0
2024-10-22 17:09:14.291000 r14:ffff8c82c0000000
2024-10-22 17:09:14.291000 r15:0000000000000000
2024-10-22 17:09:14.291000 cr2:000000c142130000
> 

And bit 31 in the error code is set, which means it's an RMP
violation. 

Looks like we set up a 2MiB page covering the whole range from
0xc142000000 to 0xc142200000, but we aren't allowed to touch the first
half of that.

For me it happens either with or without Steve's last patch, *but*
clearing direct_gbpages did seem to make it go away (or at least
reduced the incident rate far below the 1-crash-in-1000-kexecs which I
was seeing before).

I think Steve's original patch was just moving things around a little
and because it allocate more pages for page tables, just happened to
leave pages in the offending range to be allocated for writing to, for
the unlucky victims.

I think the patch was actually along the right lines though, although
it needs to go all the way down to 4KiB PTEs in some cases. And it
could probably map anything that the e820 calls 'usable RAM', rather
than really restricting itself to precisely the ranges which it's
requested to map. 

¹ I'll post that exception handler at some point once I've tidied it
up.
Attachment:
smime.p7s

Description: S/MIME cryptographic signature