[Bug 206215] QEMU guest crash due to random 'general protection fault' since kernel 5.2.5 on i7-3517UE

bugzilla-daemon@xxxxxxxxxxxxxxxxxxx · Wed, 15 Jan 2020 22:15:27 +0000

https://bugzilla.kernel.org/show_bug.cgi?id=206215

--- Comment #2 from kernel@xxxxxxxxxx ---
(In reply to Sean Christopherson from comment #1)
> Created attachment 286833 [details]
> 0001-thread_info-Add-a-debug-hook-to-detect-FPU-changes-w.patch
> 
> +cc Derek, who is hitting the same thing.
> 
> On Wed, Jan 15, 2020 at 09:18:56PM +0000,
> bugzilla-daemon@xxxxxxxxxxxxxxxxxxx wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=206215
> > 
> >             Bug ID: 206215
> >            Summary: QEMU guest crash due to random 'general protection
> >                     fault' since kernel 5.2.5 on i7-3517UE
> >            Product: Virtualization
> >            Version: unspecified
> >     Kernel Version: 5.5.0-0.rc6
> >           Hardware: x86-64
> >                 OS: Linux
> >               Tree: Fedora
> >             Status: NEW
> >           Severity: blocking
> >           Priority: P1
> >          Component: kvm
> >           Assignee: virtualization_kvm@xxxxxxxxxxxxxxxxxxxx
> >           Reporter: kernel@xxxxxxxxxx
> >         Regression: Yes
> > 
> > Created attachment 286831 [details]
> >   --> https://bugzilla.kernel.org/attachment.cgi?id=286831&action=edit
> > relevant logs
> > 
> > Since kernel 5.2.5 any qemu guest fail to start due to "general protection
> > fault"
> > 
> > [  188.533545] traps: gsd-wacom[1855] general protection fault
> > ip:7fed39b5e7b0
> > sp:7fff3e349620 error:0 in libglib-2.0.so.0.6200.1[7fed39ae3000+83000]
> > [  192.002357] traps: gvfs-fuse-sub[1560] general protection fault
> > ip:7f9cd88100b2 sp:7f9cd5db0bf0 error:0 in
> > libglib-2.0.so.0.6200.1[7f9cd87de000+83000]
> > 
> > Please note that kernel 5.2.4 work fine.
> > 
> > Tested guests with Widows Server 2016/2019 & Fedora 31
> > 
> > Attached logs show the DMESG output of the guests
> > 
> > Attached host files contains a WARNING thrown upong first guest start on
> the
> > hypervisor:
> > 
> > [   49.533713] WARNING: CPU: 3 PID: 966 at arch/x86/kvm/x86.c:7963
> > kvm_arch_vcpu_ioctl_run+0x1927/0x1ce0 [kvm]
> 
> Between the WARN, which is
> 
>   WARN_ON_ONCE(test_thread_flag(TIF_NEED_FPU_LOAD));
> 
> and the total diff of arch/x86/kvm for 5.2.4 -> 5.2.5 is
> 
> --- 5.2.4/arch/x86/kvm/x86.c    2020-01-15 13:37:05.154445843 -0800
> +++ 5.2.5/arch/x86/kvm/x86.c    2020-01-15 13:37:08.190438719 -0800
> @@ -3264,6 +3264,10 @@
> 
>         kvm_x86_ops->vcpu_load(vcpu, cpu);
> 
> +       fpregs_assert_state_consistent();
> +       if (test_thread_flag(TIF_NEED_FPU_LOAD))
> +               switch_fpu_return();
> +
>         /* Apply any externally detected TSC adjustments (due to suspend) */
>         if (unlikely(vcpu->arch.tsc_offset_adjustment)) {
>                 adjust_tsc_offset_host(vcpu,
> vcpu->arch.tsc_offset_adjustment);
> @@ -7955,9 +7959,8 @@
>                 wait_lapic_expire(vcpu);
>         guest_enter_irqoff();
> 
> -       fpregs_assert_state_consistent();
> -       if (test_thread_flag(TIF_NEED_FPU_LOAD))
> -               switch_fpu_return();
> +       /* The preempt notifier should have taken care of the FPU already. 
> */
> +       WARN_ON_ONCE(test_thread_flag(TIF_NEED_FPU_LOAD));
> 
>         if (unlikely(vcpu->arch.switch_db_regs)) {
>                 set_debugreg(0, 7); 
> 
> 
> that's a big smoking gun pointing at commit ca7e6b286333 ("KVM: X86: Fix
> fpu state crash in kvm guest"), which is commit e751732486eb upstream.
> 
> 1. Can you verify reverting ca7e6b286333 (or e751732486eb in upstream)
>    solves the issue?
> 
> 2. Assuming the answer is yes, on a buggy kernel, can you run with the
>    attached patch to try get debug info?
> 
> > [   49.533714] Modules linked in: vhost_net vhost tap tun xfrm4_tunnel
> > tunnel4
> > ipcomp xfrm_ipcomp esp4 ah4 af_key ebtable_filter ebtables ip6table_filter
> > ip6_tables bridge stp llc nf_log_ipv4 nf_log_common xt_LOG ipt_REJECT
> > nf_reject_ipv4 iptable_filter iptable_security iptable_raw xt_state
> > xt_conntrack xt_DSCP xt_multiport iptable_mangle xt_TCPMSS xt_tcpmss
> > xt_policy
> > xt_nat iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
> > intel_rapl
> > x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel sunrpc kvm vfat
> fat
> > mei_hdcp mei_wdt snd_hda_codec_hdmi iTCO_wdt irqbypass iTCO_vendor_support
> > snd_hda_codec_realtek snd_hda_codec_generic crct10dif_pclmul crc32_pclmul
> > ledtrig_audio snd_hda_intel ghash_clmulni_intel snd_hda_codec intel_cstate
> > intel_uncore snd_hda_core snd_hwdep intel_rapl_perf snd_seq snd_seq_device
> > snd_pcm i2c_i801 r8169 lpc_ich mei_me snd_timer snd mei e1000e soundcore
> > pcc_cpufreq tcp_bbr sch_fq ip_tables xfs i915 libcrc32c i2c_algo_bit
> > drm_kms_helper crc32c_intel drm
> > [   49.533760]  serio_raw video
> > [   49.533764] CPU: 3 PID: 966 Comm: CPU 0/KVM Not tainted
> > 5.2.5-200.fc30.x86_64 #1
> > [   49.533765] Hardware name: CompuLab 0000000-00000/Intense-PC, BIOS
> > IPC_2.2.400.5 X64 03/15/2018
> > [   49.533784] RIP: 0010:kvm_arch_vcpu_ioctl_run+0x1927/0x1ce0 [kvm]
> > [   49.533786] Code: 4c 89 e7 e8 1b 0b ff ff 4c 89 e7 e8 d3 8c fe ff 41 83
> a4
> > 24 e8 36 00 00 fb e9 bd ed ff ff f0 41 80 4c 24 31 10 e9 a5 ee ff ff <0f>
> 0b
> > e9
> > 74 ed ff ff 49 8b 84 24 c8 02 00 00 a9 00 00 01 00 0f 84
> > [   49.533787] RSP: 0018:ffffbe4e423ffd30 EFLAGS: 00010002
> > [   49.533789] RAX: 0000000000004b00 RBX: 0000000000000000 RCX:
> > ffffa044ce958000
> > [   49.533790] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> > 0000000000000000
> > [   49.533791] RBP: ffffbe4e423ffdd8 R08: 0000000000000000 R09:
> > 00000000000003e8
> > [   49.533792] R10: 0000000000000000 R11: 0000000000000000 R12:
> > ffffa044d38f8000
> > [   49.533792] R13: 0000000000000000 R14: ffffbe4e41ccf7b8 R15:
> > 0000000000000000
> > [   49.533794] FS:  00007f117953f700(0000) GS:ffffa044ee2c0000(0000)
> > knlGS:0000000000000000
> > [   49.533795] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   49.533796] CR2: 0000000000000000 CR3: 000000040e8c2003 CR4:
> > 00000000001626e0
> > [   49.533797] Call Trace:
> > [   49.533817]  kvm_vcpu_ioctl+0x215/0x5c0 [kvm]
> > [   49.533821]  ? __seccomp_filter+0x7b/0x640
> > [   49.533824]  ? __switch_to_asm+0x34/0x70
> > [   49.533826]  ? __switch_to_asm+0x34/0x70
> > [   49.533827]  ? apic_timer_interrupt+0xa/0x20
> > [   49.533831]  do_vfs_ioctl+0x405/0x660
> > [   49.533834]  ksys_ioctl+0x5e/0x90
> > [   49.533836]  __x64_sys_ioctl+0x16/0x20
> > [   49.533839]  do_syscall_64+0x5f/0x1a0
> > [   49.533842]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [   49.533844] RIP: 0033:0x7f117d1fb34b
> > [   49.533845] Code: 0f 1e fa 48 8b 05 3d 9b 0c 00 64 c7 00 26 00 00 00 48
> c7
> > c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48>
> 3d
> > 01
> > f0 ff ff 73 01 c3 48 8b 0d 0d 9b 0c 00 f7 d8 64 89 01 48
> > [   49.533846] RSP: 002b:00007f117953e698 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000010
> > [   49.533848] RAX: ffffffffffffffda RBX: 0000564f2cb65ba0 RCX:
> > 00007f117d1fb34b
> > [   49.533849] RDX: 0000000000000000 RSI: 000000000000ae80 RDI:
> > 0000000000000019
> > [   49.533850] RBP: 00007f1179f20000 R08: 0000564f2b7e5390 R09:
> > 000000000000ffff
> > [   49.533851] R10: 0000564f2ca7a710 R11: 0000000000000246 R12:
> > 0000000000000001
> > [   49.533852] R13: 00007f1179f21002 R14: 0000000000000000 R15:
> > 0000564f2bc66e80
> > [   49.533854] ---[ end trace a562473b18c9b742 ]---
> > 
> > /proc/cpuinfo
> > 
> > processor       : 0
> > vendor_id       : GenuineIntel
> > cpu family      : 6
> > model           : 58
> > model name      : Intel(R) Core(TM) i7-3517UE CPU @ 1.70GHz
> > stepping        : 9
> > microcode       : 0x1f
> > cpu MHz         : 828.296
> > cache size      : 4096 KB
> > physical id     : 0
> > siblings        : 4
> > core id         : 0
> > cpu cores       : 2
> > apicid          : 0
> > initial apicid  : 0
> > fpu             : yes
> > fpu_exception   : yes
> > cpuid level     : 13
> > wp              : yes
> > flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> > cmov
> > pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp
> > lm
> > constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc
> cpuid
> > aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16
> > xtpr
> > pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c
> > rdrand lahf_lm cpuid_fault epb pti ibrs ibpb stibp tpr_shadow vnmi
> > flexpriority
> > ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts
> > bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
> > mds
> > bogomips        : 4389.89
> > clflush size    : 64
> > cache_alignment : 64
> > address sizes   : 36 bits physical, 48 bits virtual
> > power management:
> > 
> > -- 
> > You are receiving this mail because:
> > You are watching the assignee of the bug.

Sean,
Thank you for the quick feedback.
In the zip file I did attach DMESG logs with latest vanilla kernel with same
behavior:
5.5.0-0.rc6.git1.1.vanilla.knurd.1.fc31.x86_64

If I'm rebuilding the kernel I'd rather spend time on the most recent one ?

Can you confirm if I apply the patch on kernel 5.5.0-rc6 ?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.