> On Sep 11, 2018, at 6:30 AM, Guenter Roeck <linux@xxxxxxxxxxxx> wrote: > > On 09/11/2018 04:52 AM, Andy Lutomirski wrote: >>> On Sep 10, 2018, at 2:56 PM, Guenter Roeck <linux@xxxxxxxxxxxx> wrote: >>> >>> Hi folks, >>> >>> even after commit eeb89e2bb1ac ("x86/efi: Load fixmap GDT in >>> efi_call_phys_epilog()"), my i386/efi qemu boot tests still crash randomly >>> (roughly 5-10% of the time). As before, I don't see much useful output in >>> the qemu log (this time it doesn't even complain about a triple fault). >>> >>> Debugging shows that the crash happens in efi_call_phys_epilog(). >>> A sample log from a crashed test run is attached below. It appears that >>> the crash happens if there is an interrupt at a critical section of the >>> code. >>> >>> While playing with the code, I found a possible fix. >>> >>> diff --git a/arch/x86/platform/efi/efi_32.c b/arch/x86/platform/efi/efi_32.c >>> index 05ca14222463..9959657127f4 100644 >>> --- a/arch/x86/platform/efi/efi_32.c >>> +++ b/arch/x86/platform/efi/efi_32.c >>> @@ -85,10 +85,9 @@ pgd_t * __init efi_call_phys_prolog(void) >>> >>> void __init efi_call_phys_epilog(pgd_t *save_pgd) >>> { >>> + load_fixmap_gdt(0); >>> load_cr3(save_pgd); >>> __flush_tlb_all(); >>> - >>> - load_fixmap_gdt(0); >>> } >> We have IRQs on here? It seems plausible that we’re in a window where the EFI pgd doesn’t have cpu_entry_area mapped. Also, the hard coded CPU 0 is suspicious. > The hard coded CPU 0 was always there. The call is ultimately from > efi_enter_virtual_mode(), which is called from start_kernel(). > so presumably it is guaranteed to run on CPU 0. > >> Maybe try instrumenting the code to check whether the clone_pgd_range calls in setup_percpu.c have happened yet? > The crash is seen late in the boot process, so I am quite sure it happened, > but I can add a check if needed. I think that might be a different problem, > though. > >> Your patch may well be correct, but, if we have IRQs on, we should really have cpu_entry_area mapped in both pgds. >> Or we could turn off IRQs. Why on Earth are IRQs on in a context where the fixmap gdt is unusable? > > From arch/x86/platform/efi/efi.c:phys_efi_set_virtual_address_map(): > > save_pgd = efi_call_phys_prolog(); > local_irq_save(flags); > status = efi_call_phys(...); > local_irq_restore(flags); > > efi_call_phys_epilog(save_pgd); > > So, yes, interrupts are very much enabled. Does fixing that solve the problem? It seems more robust. > > I ran several additional test sequences. With above patch, no failures with > 500 boots. Without it, failure rate (long term average) across 500 boots > is around 10%. Another data point: Moving load_fixmap_gdt(0); after > load_cr3(save_pgd); does not help; it has to come first. > > Guenter