On Thu, Apr 25, 2024, Sean Christopherson wrote: > Register KVM's cpuhp and syscore callback when enabling virtualization > in hardware instead of registering the callbacks during initialization, > and let the CPU up/down framework invoke the inner enable/disable > functions. Registering the callbacks during initialization makes things > more complex than they need to be, as KVM needs to be very careful about > handling races between enabling CPUs being onlined/offlined and hardware > being enabled/disabled. > > Intel TDX support will require KVM to enable virtualization during KVM > initialization, i.e. will add another wrinkle to things, at which point > sorting out the potential races with kvm_usage_count would become even > more complex. > +static int hardware_enable_all(void) > +{ > + int r; > + > + guard(mutex)(&kvm_lock); > + > + if (kvm_usage_count++) > + return 0; > + > + r = cpuhp_setup_state(CPUHP_AP_KVM_ONLINE, "kvm/cpu:online", > + kvm_online_cpu, kvm_offline_cpu); > + if (r) > + return r; There's a lock ordering issue here. KVM currently takes kvm_lock inside cpu_hotplug_lock, but this code does the opposite. I need to take a closer look at the locking, as I'm not entirely certain that the existing ordering is correct or ideal. E.g. cpu_hotplug_lock is taken when updating static keys, static calls, etc., which makes taking cpu_hotplug_lock outside kvm_lock dicey, as flows that take kvm_lock then need to be very careful to never trigger seemingly innocuous updates. And this lockdep splat that I've now hit twice with the current implementation suggests that cpu_hotplug_lock => kvm_lock is already unsafe/broken (I need to re-decipher the splat; I _think_ mostly figured it out last week, but then forgot over the weekend). [48419.244819] ====================================================== [48419.250999] WARNING: possible circular locking dependency detected [48419.257179] 6.9.0-smp--04b1c6b4841d-next #301 Tainted: G S O [48419.264050] ------------------------------------------------------ [48419.270229] tee/27085 is trying to acquire lock: [48419.274849] ffffb5fdd4e430a8 (&kvm->slots_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x179/0x1e0 [kvm] [48419.284085] but task is already holding lock: [48419.289918] ffffffffc06ccba8 (kvm_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x14a/0x1e0 [kvm] [48419.298386] which lock already depends on the new lock. [48419.306559] the existing dependency chain (in reverse order) is: [48419.314040] -> #3 (kvm_lock){+.+.}-{3:3}: [48419.319523] __mutex_lock+0x6a/0xb40 [48419.323622] mutex_lock_nested+0x1f/0x30 [48419.328071] kvm_dev_ioctl+0x4fb/0xe50 [kvm] [48419.332898] __se_sys_ioctl+0x7b/0xd0 [48419.337082] __x64_sys_ioctl+0x21/0x30 [48419.341357] do_syscall_64+0x8b/0x170 [48419.345540] entry_SYSCALL_64_after_hwframe+0x71/0x79 [48419.351115] -> #2 (cpu_hotplug_lock){++++}-{0:0}: [48419.357294] cpus_read_lock+0x2e/0xb0 [48419.361480] static_key_slow_inc+0x16/0x30 [48419.366098] kvm_lapic_set_base+0x6a/0x1c0 [kvm] [48419.371298] kvm_set_apic_base+0x8f/0xe0 [kvm] [48419.376298] kvm_set_msr_common+0xa29/0x1080 [kvm] [48419.381645] vmx_set_msr+0xa54/0xbe0 [kvm_intel] [48419.386795] __kvm_set_msr+0xb6/0x1a0 [kvm] [48419.391535] kvm_arch_vcpu_ioctl+0xf66/0x1150 [kvm] [48419.396970] kvm_vcpu_ioctl+0x485/0x5b0 [kvm] [48419.401881] __se_sys_ioctl+0x7b/0xd0 [48419.406067] __x64_sys_ioctl+0x21/0x30 [48419.410342] do_syscall_64+0x8b/0x170 [48419.414528] entry_SYSCALL_64_after_hwframe+0x71/0x79 [48419.420099] -> #1 (&kvm->srcu){.?.+}-{0:0}: [48419.425758] __synchronize_srcu+0x44/0x1a0 [48419.430378] synchronize_srcu_expedited+0x21/0x30 [48419.435603] kvm_swap_active_memslots+0x113/0x1c0 [kvm] [48419.441385] kvm_set_memslot+0x360/0x620 [kvm] [48419.446387] __kvm_set_memory_region+0x27b/0x300 [kvm] [48419.452078] kvm_vm_ioctl_set_memory_region+0x43/0x60 [kvm] [48419.458207] kvm_vm_ioctl+0x295/0x650 [kvm] [48419.462945] __se_sys_ioctl+0x7b/0xd0 [48419.467133] __x64_sys_ioctl+0x21/0x30 [48419.471406] do_syscall_64+0x8b/0x170 [48419.475590] entry_SYSCALL_64_after_hwframe+0x71/0x79 [48419.481164] -> #0 (&kvm->slots_lock){+.+.}-{3:3}: [48419.487343] __lock_acquire+0x15ef/0x2e40 [48419.491876] lock_acquire+0xe0/0x260 [48419.495977] __mutex_lock+0x6a/0xb40 [48419.500076] mutex_lock_nested+0x1f/0x30 [48419.504521] set_nx_huge_pages+0x179/0x1e0 [kvm] [48419.509694] param_attr_store+0x93/0x100 [48419.514142] module_attr_store+0x22/0x40 [48419.518587] sysfs_kf_write+0x81/0xb0 [48419.522774] kernfs_fop_write_iter+0x133/0x1d0 [48419.527738] vfs_write+0x317/0x380 [48419.531663] ksys_write+0x70/0xe0 [48419.535505] __x64_sys_write+0x1f/0x30 [48419.539777] do_syscall_64+0x8b/0x170 [48419.543961] entry_SYSCALL_64_after_hwframe+0x71/0x79 [48419.549534] other info that might help us debug this: [48419.557534] Chain exists of: &kvm->slots_lock --> cpu_hotplug_lock --> kvm_lock [48419.567873] Possible unsafe locking scenario: [48419.573793] CPU0 CPU1 [48419.578325] ---- ---- [48419.582859] lock(kvm_lock); [48419.585831] lock(cpu_hotplug_lock); [48419.592011] lock(kvm_lock); [48419.597499] lock(&kvm->slots_lock); [48419.601173] *** DEADLOCK *** [48419.607090] 5 locks held by tee/27085: [48419.610841] #0: ffffa0dcc92eb410 (sb_writers#4){.+.+}-{0:0}, at: vfs_write+0xe4/0x380 [48419.618765] #1: ffffa0dce221ac88 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xd8/0x1d0 [48419.627553] #2: ffffa11c4d6bef28 (kn->active#257){.+.+}-{0:0}, at: kernfs_fop_write_iter+0xe0/0x1d0 [48419.636684] #3: ffffffffc06e3c50 (&mod->param_lock){+.+.}-{3:3}, at: param_attr_store+0x57/0x100 [48419.645575] #4: ffffffffc06ccba8 (kvm_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x14a/0x1e0 [kvm] [48419.654564] stack backtrace: [48419.658925] CPU: 93 PID: 27085 Comm: tee Tainted: G S O 6.9.0-smp--04b1c6b4841d-next #301 [48419.668312] Hardware name: Google Interlaken/interlaken, BIOS 0.20231025.0-0 10/25/2023 [48419.676314] Call Trace: [48419.678770] <TASK> [48419.680878] dump_stack_lvl+0x83/0xc0 [48419.684552] dump_stack+0x14/0x20 [48419.687880] print_circular_bug+0x2f0/0x300 [48419.692068] check_noncircular+0x103/0x120 [48419.696163] ? __lock_acquire+0x5e3/0x2e40 [48419.700266] __lock_acquire+0x15ef/0x2e40 [48419.704286] ? __lock_acquire+0x5e3/0x2e40 [48419.708387] ? __lock_acquire+0x5e3/0x2e40 [48419.712486] ? __lock_acquire+0x5e3/0x2e40 [48419.716586] lock_acquire+0xe0/0x260 [48419.720164] ? set_nx_huge_pages+0x179/0x1e0 [kvm] [48419.725060] ? lock_acquire+0xe0/0x260 [48419.728822] ? param_attr_store+0x57/0x100 [48419.732921] ? set_nx_huge_pages+0x179/0x1e0 [kvm] [48419.737768] __mutex_lock+0x6a/0xb40 [48419.741359] ? set_nx_huge_pages+0x179/0x1e0 [kvm] [48419.746207] ? set_nx_huge_pages+0x14a/0x1e0 [kvm] [48419.751054] ? param_attr_store+0x57/0x100 [48419.755158] ? __mutex_lock+0x240/0xb40 [48419.759005] ? param_attr_store+0x57/0x100 [48419.763107] mutex_lock_nested+0x1f/0x30 [48419.767031] set_nx_huge_pages+0x179/0x1e0 [kvm] [48419.771705] param_attr_store+0x93/0x100 [48419.775629] module_attr_store+0x22/0x40 [48419.779556] sysfs_kf_write+0x81/0xb0 [48419.783222] kernfs_fop_write_iter+0x133/0x1d0 [48419.787668] vfs_write+0x317/0x380 [48419.791084] ksys_write+0x70/0xe0 [48419.794401] __x64_sys_write+0x1f/0x30 [48419.798152] do_syscall_64+0x8b/0x170 [48419.801819] entry_SYSCALL_64_after_hwframe+0x71/0x79 [48419.806872] RIP: 0033:0x7f98eff1ee0d [48419.810465] Code: e5 41 57 41 56 53 50 49 89 d6 49 89 f7 89 fb 48 8b 05 37 7c 07 00 83 38 00 75 28 b8 01 00 00 00 89 df 4c 89 fe 4c 89 f2 0f 05 <48> 89 c3 48 3d 01 f0 ff ff 73 3a 48 89 d8 48 83 c4 08 5b 41 5e 41 [48419.829213] RSP: 002b:00007ffd5dee15c0 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [48419.836792] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f98eff1ee0d [48419.843931] RDX: 0000000000000002 RSI: 00007ffd5dee16d0 RDI: 0000000000000005 [48419.851065] RBP: 00007ffd5dee15e0 R08: 00007f98efe161d2 R09: 0000000000000001 [48419.858196] R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000002 [48419.865328] R13: 00007ffd5dee16d0 R14: 0000000000000002 R15: 00007ffd5dee16d0 [48419.872472] </TASK>