On 2/6/24 12:58, Sean Christopherson wrote: > On Tue, Feb 06, 2024, Dongli Zhang wrote: >> Hi Prasad, >> >> On 1/2/24 23:53, Prasad Pandit wrote: >>> From: Prasad Pandit <pjp@xxxxxxxxxxxxxxxxx> >>> >>> kvm_vcpu_ioctl_x86_set_vcpu_events() routine makes 'KVM_REQ_NMI' >>> request for a vcpu even when its 'events->nmi.pending' is zero. >>> Ex: >>> qemu_thread_start >>> kvm_vcpu_thread_fn >>> qemu_wait_io_event >>> qemu_wait_io_event_common >>> process_queued_cpu_work >>> do_kvm_cpu_synchronize_post_init/_reset >>> kvm_arch_put_registers >>> kvm_put_vcpu_events (cpu, level=[2|3]) >>> >>> This leads vCPU threads in QEMU to constantly acquire & release the >>> global mutex lock, delaying the guest boot due to lock contention. >> >> Would you mind sharing where and how the lock contention is at QEMU space? That >> is, how the QEMU mutex lock is impacted by KVM KVM_REQ_NMI? >> >> Or you meant line 3031 at QEMU side? > > Yeah, something like that. Details in this thread. > > https://urldefense.com/v3/__https://lore.kernel.org/all/CAE8KmOyffXD4k69vRAFwesaqrBGzFY3i*kefbkHcQf4=jNYzOA@mail.gmail.com__;Kw!!ACWV5N9M2RV99hQ!N61g2QXuC5B5RpVNBowgKUXjHzX4vp0lCXuton3fKVRbzBuXaVtJgePu0ddSf3EB9EEQORTmwop4vD5KrQ$ Thank you very much for pointing to the discussion. I should have found them :) Here is my understanding. 1. During the VM creation, the mp_state of AP (non-BSP) is always KVM_MP_STATE_UNINITIALIZED, until INIT/SIPI. 2. Ideally, AP should block at below. That is, line 3775 is always false. 3760 bool kvm_vcpu_block(struct kvm_vcpu *vcpu) 3761 { 3762 struct rcuwait *wait = kvm_arch_vcpu_get_wait(vcpu); 3763 bool waited = false; 3764 3765 vcpu->stat.generic.blocking = 1; 3766 3767 preempt_disable(); 3768 kvm_arch_vcpu_blocking(vcpu); 3769 prepare_to_rcuwait(wait); 3770 preempt_enable(); 3771 3772 for (;;) { 3773 set_current_state(TASK_INTERRUPTIBLE); 3774 3775 if (kvm_vcpu_check_block(vcpu) < 0) 3776 break; 3777 3778 waited = true; 3779 schedule(); 3780 } 3. Unfortunately, the issue may set KVM_REQ_NMI for AP. 4. This leads to the kvm_vcpu_check_block() to return. kvm_arch_vcpu_ioctl_run() -> kvm_vcpu_block() -> kvm_vcpu_check_block() -> kvm_arch_vcpu_runnable() -> kvm_vcpu_has_events() -> kvm_test_request(KVM_REQ_NMI, vcpu) 5. The kvm_arch_vcpu_ioctl_run() returns to QEMU with -EAGAIN. 6. The QEMU side is not able to handle -EAGAIN, but to goto line 2984 to return. It acquires the global mutex at line 2976 (release before entering into guest again). The KVM_REQ_NMI is never cleared until INIT/SIPI. 2808 int kvm_cpu_exec(CPUState *cpu) 2809 { ... ... 2868 if (run_ret < 0) { 2869 if (run_ret == -EINTR || run_ret == -EAGAIN) { 2870 trace_kvm_io_window_exit(); 2871 kvm_eat_signals(cpu); 2872 ret = EXCP_INTERRUPT; 2873 break; 2874 } ... ... 2973 } while (ret == 0); 2974 2975 cpu_exec_end(cpu); 2976 bql_lock(); 2977 2978 if (ret < 0) { 2979 cpu_dump_state(cpu, stderr, CPU_DUMP_CODE); 2980 vm_stop(RUN_STATE_INTERNAL_ERROR); 2981 } 2982 2983 qatomic_set(&cpu->exit_request, 0); 2984 return ret; 2985 } 7. The QEMU AP vCPU thread enters into KVM_RUN again. Same flow as step 4, goto step 4, again and again. The lock has been frequently acquired/released. The vCPU 0 is unhappy with it, especially when the number of APs is large! I guess it is not an issue after VM reboot (without QEMU instance re-creation because the mpstate is not KVM_MP_STATE_UNINITIALIZED any longer). Thank you very much! Dongli Zhang > >> 2858 int kvm_cpu_exec(CPUState *cpu) >> 2859 { >> 2860 struct kvm_run *run = cpu->kvm_run; >> 2861 int ret, run_ret; >> ... ... >> 3023 default: >> 3024 DPRINTF("kvm_arch_handle_exit\n"); >> 3025 ret = kvm_arch_handle_exit(cpu, run); >> 3026 break; >> 3027 } >> 3028 } while (ret == 0); >> 3029 >> 3030 cpu_exec_end(cpu); >> 3031 qemu_mutex_lock_iothread(); >> 3032 >> 3033 if (ret < 0) { >> 3034 cpu_dump_state(cpu, stderr, CPU_DUMP_CODE); >> 3035 vm_stop(RUN_STATE_INTERNAL_ERROR); >> 3036 } >> 3037 >> 3038 qatomic_set(&cpu->exit_request, 0); >> 3039 return ret; >> 3040 }