On Mon, 2024-11-25 at 09:54 +0000, David Woodhouse wrote: > From: David Woodhouse <dwmw@xxxxxxxxxxxx> > > The control_code_page should be explicitly mapped into the identity > mapped page tables for the relocate_kernel environment. This only seems > to have worked by luck before, because it tended to be within the same > 2MiB or 1GiB large page already mapped for another reason. > > A subsequent commit will reduce the control_code_page to a single 4KiB > page instead of a higher-order allocation, and seems to make it much > *less* likely that we get lucky with its placement. This leads to a > fault when relocate_kernel() first tries to access the page through its > identity-mapped virtual address. This one is confusing me. Jan points out that it shouldn't be needed, because the control page should come from kernel memory and thus should be mapped anyway because the loop immediately below my added code adds *all* of the pfn_mapped[] ranges. And from code inspection he appears to be right, but if I disable the new mapping and add some printks... --- a/arch/x86/kernel/machine_kexec_64.c +++ b/arch/x86/kernel/machine_kexec_64.c @@ -247,15 +247,18 @@ static int init_pgtable(struct kimage *image, unsigned long control_page) info.direct_gbpages = true; /* Ensure the control code page itself is in the direct map */ + pr_info("No Map control page at %lx", control_page); +#if 0 result = kernel_ident_mapping_init(&info, image->arch.pgd, control_page, control_page + KEXEC_CONTROL_CODE_MAX_SIZE); if (result) return result; - +#endif for (i = 0; i < nr_pfn_mapped; i++) { mstart = pfn_mapped[i].start << PAGE_SHIFT; mend = pfn_mapped[i].end << PAGE_SHIFT; + pr_info("Map pfn_mapped[%d] %lx - %lx\n", i, mstart, mend); result = kernel_ident_mapping_init(&info, image->arch.pgd, mstart, mend); if (result) ... and run in a version of qemu which dumps the CPU state on triple- fault... + ./loadret [ 0.948097] kexec: No Map control page at 2b32000 [ 0.948103] kexec: Map pfn_mapped[0] 0 - 7ffdd000 [ 0.960192] Freezing user space processes [ 0.961685] Freezing user space processes completed (elapsed 0.001 seconds) [ 0.962372] OOM killer disabled. [ 1.088668] ata2: found unknown device (class 0) [ 1.095810] Disabling non-boot CPUs ... [ 1.117990] smpboot: CPU 1 is now offline [ 1.118595] crash hp: kexec_trylock() failed, kdump image may be inaccurate RAX=0000000080050033 RBX=0000000000000000 RCX=0000000000000001 RDX=0000000000400000 RSI=0000000002b3205a RDI=0000000003a44002 RBP=ffff9709c2109400 RSP=0000000002b33000 R8 =0000000000000000 R9 =00000000038a0000 R10=0000000000000000 R11=0000000000000001 R12=0000000000000000 R13=0000000000170ef0 R14=00000000fee1dead R15=0000000000000000 RIP=ffff9709c2b32057 RFL=00010006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA] SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] DS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] FS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] GS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] LDT=0000 0000000000000000 00000000 00000000 TR =0040 fffffe2fb91b2000 00004087 00008b00 DPL=0 TSS64-busy GDT= 0000000000000000 00000000 IDT= 0000000000000000 00000000 CR0=80050033 CR2=0000000002b32ff8 CR3=00000000038a0000 CR4=00170ef0 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000d01 Code=04 00 00 49 89 cb 48 8d a6 00 10 00 00 48 81 c6 5a 00 00 00 <56> c3 cc 6a 00 52 48 8d 05 8c 04 00 00 50 66 ff 30 0f 01 14 24 48 83 c4 0a 8c d8 8e d8 48 RIP xxx057 is here, where relocate_kernel first touches the 1:1 mapping of the control page: /* setup a new stack at the end of the physical control page */ lea PAGE_SIZE(%rsi), %rsp 49: 48 8d a6 00 10 00 00 lea 0x1000(%rsi),%rsp /* jump to identity mapped page */ addq $(identity_mapped - relocate_kernel), %rsi 50: 48 81 c6 5a 00 00 00 add $0x5a,%rsi pushq %rsi 57: 56 push %rsi The control page at 2b32xxx *really* ought to be mapped, as it's clearly within the 0 - 7ffdd000 range. What's going on?
<<attachment: smime.p7s>>