There seem to be two problems with the way arch.xen.shinfo_cache instance of gfn_to_pfn_cache is treated. 1. gpc->lock is taken without checking if it was actually initialized. e.g. kvm_xen_set_evtchn_fast(): read_lock_irqsave(&gpc->lock, flags); if (!kvm_gfn_to_pfn_cache_check(kvm, gpc, gpc->gpa, PAGE_SIZE)) goto out_rcu; INFO: trying to register non-static key. The code is fine but needs lockdep annotation, or maybe you didn't initialize this object before use? turning off the locking correctness validator. CPU: 2 PID: 959 Comm: xenirq Not tainted 6.0.0-rc5 #12 Call Trace: dump_stack_lvl+0x5b/0x77 register_lock_class+0x46d/0x480 __lock_acquire+0x64/0x1fa0 lock_acquire+0xbf/0x2b0 ? kvm_xen_set_evtchn_fast+0xc7/0x400 [kvm] ? lock_acquire+0xcf/0x2b0 ? _raw_read_lock_irqsave+0x99/0xa0 _raw_read_lock_irqsave+0x81/0xa0 ? kvm_xen_set_evtchn_fast+0xc7/0x400 [kvm] kvm_xen_set_evtchn_fast+0xc7/0x400 [kvm] ? kvm_xen_set_evtchn_fast+0x7e/0x400 [kvm] ? find_held_lock+0x2b/0x80 kvm_xen_hvm_evtchn_send+0x4b/0x90 [kvm] kvm_arch_vm_ioctl+0x4de/0xca0 [kvm] ? vmx_vcpu_put+0x18/0x1e0 [kvm_intel] ? kvm_arch_vcpu_put+0x1db/0x250 [kvm] ? vcpu_put+0x46/0x70 [kvm] ? kvm_arch_vcpu_ioctl+0xd0/0x1710 [kvm] kvm_vm_ioctl+0x4e4/0xdd0 [kvm] ? lock_is_held_type+0xe3/0x140 __x64_sys_ioctl+0x8d/0xd0 do_syscall_64+0x58/0x80 ? __do_fast_syscall_32+0xeb/0xf0 ? lockdep_hardirqs_on+0x7d/0x100 ? lock_is_held_type+0xe3/0x140 ? do_syscall_64+0x67/0x80 ? lockdep_hardirqs_on+0x7d/0x100 ? do_syscall_64+0x67/0x80 ? lockdep_hardirqs_on+0x7d/0x100 entry_SYSCALL_64_after_hwframe+0x63/0xcd 2. kvm_gfn_to_pfn_cache_init() allows for gpc->lock reinitialization. This can lead to situation where a lock that is already taken gets reinitialized in another thread (and becomes corrupted). For example: a race between ioctl(KVM_XEN_HVM_EVTCHN_SEND) and kvm_gfn_to_pfn_cache_init(): (thread 1) | (thread 2) | kvm_xen_set_evtchn_fast | read_lock_irqsave(&gpc->lock, ...) | | kvm_gfn_to_pfn_cache_init | rwlock_init(&gpc->lock) read_unlock_irqrestore(&gpc->lock, ...) | Testing shinfo lock corruption (KVM_XEN_HVM_EVTCHN_SEND) rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 1-...D } 26610 jiffies s: 265 root: 0x2/. rcu: blocking rcu_node structures (internal RCU debug): Task dump for CPU 1: task:xen_shinfo_test state:R running task stack: 0 pid: 952 ppid: 867 flags:0x00000008 Call Trace: ? exc_page_fault+0x121/0x2b0 ? rcu_read_lock_sched_held+0x10/0x80 ? entry_SYSCALL_64_after_hwframe+0x63/0xcd rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: rcu: 1-...0: (5 ticks this GP) idle=6b94/1/0x4000000000000000 softirq=5929/5931 fqs=15261 (detected by 0, t=65002 jiffies, g=5465, q=100 ncpus=4) Sending NMI from CPU 0 to CPUs 1: NMI backtrace for cpu 1 CPU: 1 PID: 952 Comm: xen_shinfo_test Not tainted 6.0.0-rc5 #12 RIP: 0010:queued_write_lock_slowpath+0x68/0x90 Call Trace: do_raw_write_lock+0xad/0xb0 kvm_gfn_to_pfn_cache_refresh+0x2a5/0x630 [kvm] kvm_xen_hvm_set_attr+0x19d/0x5e0 [kvm] kvm_arch_vm_ioctl+0x8ca/0xca0 [kvm] ? rcu_read_lock_sched_held+0x10/0x80 kvm_vm_ioctl+0x4e4/0xdd0 [kvm] ? rcu_read_lock_sched_held+0x10/0x80 ? do_raw_write_trylock+0x29/0x40 ? rcu_read_lock_sched_held+0x10/0x80 ? lock_release+0x1ef/0x2d0 ? lock_release+0x1ef/0x2d0 __x64_sys_ioctl+0x8d/0xd0 do_syscall_64+0x58/0x80 ? exc_page_fault+0x121/0x2b0 ? rcu_read_lock_sched_held+0x10/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Testing shinfo lock corruption (KVM_XEN_HVM_EVTCHN_SEND) BUG: kernel NULL pointer dereference, address: 0000000000000800 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 1049cf067 P4D 1049cf067 PUD 104ba0067 PMD 0 Oops: 0002 [#1] PREEMPT SMP NOPTI CPU: 0 PID: 955 Comm: xen_shinfo_test Not tainted 6.0.0-rc5 #12 RIP: 0010:kvm_xen_set_evtchn_fast+0x10f/0x400 [kvm] Call Trace: kvm_xen_hvm_evtchn_send+0x4b/0x90 [kvm] kvm_arch_vm_ioctl+0x4de/0xca0 [kvm] ? kvm_xen_hvm_evtchn_send+0x6e/0x90 [kvm] ? kvm_arch_vm_ioctl+0x4de/0xca0 [kvm] kvm_vm_ioctl+0x4e4/0xdd0 [kvm] ? kvm_vm_ioctl+0x4e4/0xdd0 [kvm] ? rcu_read_lock_sched_held+0x10/0x80 ? lock_release+0x1ef/0x2d0 __x64_sys_ioctl+0x8d/0xd0 do_syscall_64+0x58/0x80 ? rcu_read_lock_sched_held+0x10/0x80 ? trace_hardirqs_on_prepare+0x55/0xe0 ? do_syscall_64+0x67/0x80 ? rcu_read_lock_sched_held+0x10/0x80 ? trace_hardirqs_on_prepare+0x55/0xe0 ? do_syscall_64+0x67/0x80 ? do_syscall_64+0x67/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Similar story with the handling of hypercall SCHEDOP_poll in wait_pending_event(): Testing shinfo lock corruption (SCHEDOP_poll) rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: rcu: 1-...0: (12 ticks this GP) idle=0a5c/1/0x4000000000000000 softirq=6640/6640 fqs=12988 rcu: 2-...0: (10 ticks this GP) idle=66e4/1/0x4000000000000000 softirq=5526/5527 fqs=12988 (detected by 0, t=65003 jiffies, g=7437, q=732 ncpus=4) Sending NMI from CPU 0 to CPUs 1: NMI backtrace for cpu 1 skipped: idling at native_halt+0xa/0x10 Sending NMI from CPU 0 to CPUs 2: NMI backtrace for cpu 2 CPU: 2 PID: 970 Comm: xen_shinfo_test Not tainted 6.0.0-rc5 #12 RIP: 0010:queued_write_lock_slowpath+0x66/0x90 Call Trace: do_raw_write_lock+0xad/0xb0 kvm_gfn_to_pfn_cache_refresh+0x8a/0x630 [kvm] ? kvm_gfn_to_pfn_cache_init+0x122/0x130 [kvm] kvm_xen_hvm_set_attr+0x19d/0x5e0 [kvm] kvm_arch_vm_ioctl+0x8ca/0xca0 [kvm] ? __lock_acquire+0x3a4/0x1fa0 ? __lock_acquire+0x3a4/0x1fa0 kvm_vm_ioctl+0x4e4/0xdd0 [kvm] ? lock_is_held_type+0xe3/0x140 ? lock_release+0x135/0x2d0 __x64_sys_ioctl+0x8d/0xd0 do_syscall_64+0x58/0x80 ? lockdep_hardirqs_on+0x7d/0x100 ? do_syscall_64+0x67/0x80 ? do_syscall_64+0x67/0x80 ? do_syscall_64+0x67/0x80 ? do_syscall_64+0x67/0x80 ? do_syscall_64+0x67/0x80 ? do_syscall_64+0x67/0x80 ? do_syscall_64+0x67/0x80 ? lockdep_hardirqs_on+0x7d/0x100 entry_SYSCALL_64_after_hwframe+0x63/0xcd Testing shinfo lock corruption (SCHEDOP_poll) watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [kworker/1:2:260] irq event stamp: 6990 hardirqs last enabled at (6989): [<ffffffff81e5c964>] _raw_spin_unlock_irq+0x24/0x50 hardirqs last disabled at (6990): [<ffffffff81e53e51>] __schedule+0xd41/0x1620 softirqs last enabled at (5790): [<ffffffff81766e78>] rht_deferred_worker+0x708/0xbe0 softirqs last disabled at (5788): [<ffffffff81766967>] rht_deferred_worker+0x1f7/0xbe0 CPU: 1 PID: 260 Comm: kworker/1:2 Not tainted 6.0.0-rc5 #12 Workqueue: rcu_gp wait_rcu_exp_gp RIP: 0010:smp_call_function_single+0x11a/0x160 Call Trace: ? trace_hardirqs_on+0x2b/0xd0 __sync_rcu_exp_select_node_cpus+0x267/0x460 sync_rcu_exp_select_cpus+0x1ec/0x3e0 wait_rcu_exp_gp+0xf/0x20 process_one_work+0x254/0x560 worker_thread+0x4f/0x390 ? _raw_spin_unlock_irqrestore+0x40/0x60 ? process_one_work+0x560/0x560 kthread+0xe6/0x110 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 I'm providing a set of patches: check if shinfo lock was initialized, disallow lock reinitialization. Along with crudely made testcases. Note: as I understand, kvm->lock mutex cannot be used to protect from those races because of kvm_xen_set_evtchn_fast() being called from kvm_arch_set_irq_inatomic()? I'm sending this as a RFC as I have doubts if explicitly disallowing reinitialization this way is the most elegant solution. Especially as the problem appears to affect only the shinfo gfn_to_pfn_cache. Michal Luczaj (4): KVM: x86/xen: Ensure kvm_xen_set_evtchn_fast() can use shinfo_cache KVM: x86/xen: Ensure kvm_xen_schedop_poll() can use shinfo_cache KVM: x86/xen: Disallow gpc locks reinitialization KVM: x86/xen: Test shinfo_cache lock races arch/x86/kvm/xen.c | 5 +- include/linux/kvm_types.h | 1 + .../selftests/kvm/x86_64/xen_shinfo_test.c | 100 ++++++++++++++++++ virt/kvm/pfncache.c | 7 +- 4 files changed, 110 insertions(+), 3 deletions(-) -- 2.37.2