On Thu, 2024-03-14 at 17:25 +0800, Dave Young wrote: > On Thu, 14 Mar 2024 at 00:18, Steve Wahl <steve.wahl@xxxxxxx> wrote: > > > > On Wed, Mar 13, 2024 at 07:16:23AM -0500, Eric W. Biederman wrote: > > > > > > Kexec happens on identity mapped page tables. > > > > > > The files of interest are machine_kexec_64.c and relocate_kernel_64.S > > > > > > I suspect either the building of the identity mappged page table in > > > machine_kexec_prepare, or the switching to the page table in > > > identity_mapped in relocate_kernel_64.S is where something goes wrong. > > > > > > Probably in kernel_ident_mapping_init as that code is directly used > > > to build the identity mapped page tables. > > > > > > Hmm. > > > > > > Your change is commit d794734c9bbf ("x86/mm/ident_map: Use gbpages only > > > where full GB page should be mapped.") > > > > Yeah, sorry, I accidentally used the stable cherry-pick commit id that > > Pavin Joseph found with his bisect results. > > > > > Given the simplicity of that change itself my guess is that somewhere in > > > the first 1Gb there are pages that needed to be mapped like the idt at 0 > > > that are not getting mapped. > > > > ... > > > > > It might be worth setting up early printk on some of these systems > > > and seeing if the failure is in early boot up of the new kernel (that is > > > using kexec supplied identity mapped pages) rather than in kexec per-se. > > > > > > But that is just my guess at the moment. > > > > Thanks for the input. I was thinking in terms of running out of > > memory somewhere because we're using more page table entries than we > > used to. But you've got me thinking that maybe some necessary region > > is not explicitly requested to be placed in the identity map, but is > > by luck included in the rounding errors when we use gbpages. > > Yes, it is possible. Here is an example case: > http://lists.infradead.org/pipermail/kexec/2023-June/027301.html > Final change was to avoid doing AMD things on Intel platform, but the > mapping code is still not fixed in a good way. I spent all of Monday setting up a full GDT, IDT and exception handler for the relocate_kernel() environment¹, and I think these reports may have been the same as what I've been debugging. We end up taking a #PF, usually on one of the 'rep mov's, one time on the 'pushq %r8' right before using it to 'ret' to identity_mapped. In each case it happens on the first *write* to a page. Now I can print %cr2 when it happens (instead of just going straight to triple-fault), I spot an interesting fact about the address. It's always *adjacent* to a region reserved by BIOS in the e820 data, and within the same 2MiB page. [ 0.000000] BIOS-e820: [mem 0x000000bfbe000000-0x000000c1420fffff] reserved [ 0.000000] BIOS-e820: [mem 0x000000c142100000-0x000000fc7fffffff] usable 2024-10-22 17:09:14.291000 kern NOTICE [ 58.996257] kexec: Control page at c149431000 2024-10-22 17:09:14.291000 Y 2024-10-22 17:09:14.291000 rip:000000c1494312f8 2024-10-22 17:09:14.291000 rsp:000000c149431f90 2024-10-22 17:09:14.291000 Exc:000000000000000e 2024-10-22 17:09:14.291000 Err:0000000080000003 2024-10-22 17:09:14.291000 rax:000000c142130000 2024-10-22 17:09:14.291000 rbx:000000010d4b8020 2024-10-22 17:09:14.291000 rcx:0000000000000200 2024-10-22 17:09:14.291000 rdx:000000000009c000 2024-10-22 17:09:14.291000 rsi:000000000009c000 2024-10-22 17:09:14.291000 rdi:000000c142130000 2024-10-22 17:09:14.291000 r8 :000000c149431000 2024-10-22 17:09:14.291000 r9 :000000c149430000 2024-10-22 17:09:14.291000 r10:000000010d4bc000 2024-10-22 17:09:14.291000 r11:0000000000000000 2024-10-22 17:09:14.291000 r12:0000000000000000 2024-10-22 17:09:14.291000 r13:0000000000770ef0 2024-10-22 17:09:14.291000 r14:ffff8c82c0000000 2024-10-22 17:09:14.291000 r15:0000000000000000 2024-10-22 17:09:14.291000 cr2:000000c142130000 > And bit 31 in the error code is set, which means it's an RMP violation. Looks like we set up a 2MiB page covering the whole range from 0xc142000000 to 0xc142200000, but we aren't allowed to touch the first half of that. For me it happens either with or without Steve's last patch, *but* clearing direct_gbpages did seem to make it go away (or at least reduced the incident rate far below the 1-crash-in-1000-kexecs which I was seeing before). I think Steve's original patch was just moving things around a little and because it allocate more pages for page tables, just happened to leave pages in the offending range to be allocated for writing to, for the unlucky victims. I think the patch was actually along the right lines though, although it needs to go all the way down to 4KiB PTEs in some cases. And it could probably map anything that the e820 calls 'usable RAM', rather than really restricting itself to precisely the ranges which it's requested to map. ¹ I'll post that exception handler at some point once I've tidied it up.
Attachment:
smime.p7s
Description: S/MIME cryptographic signature