Re: Q. about KVM and CPU hotplug

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Tue, 30 Nov 2021 15:05:50 +0100

On Tue, Nov 30 2021 at 10:28, Paolo Bonzini wrote:

> On 11/30/21 09:27, Tian, Kevin wrote:
>> 		r = kvm_arch_hardware_enable();
>> 
>> 		if (r) {
>> 			cpumask_clear_cpu(cpu, cpus_hardware_enabled);
>> 			atomic_inc(&hardware_enable_failed);
>> 			pr_info("kvm: enabling virtualization on CPU%d failed\n", cpu);
>> 		}
>> 	}
>> 
>> Upon error hardware_enable_failed is incremented. However this variable
>> is checked only in hardware_enable_all() called when the 1st VM is called.
>> 
>> This implies that KVM may be left in a state where it doesn't know a CPU
>> not ready to host VMX operations.
>> 
>> Then I'm curious what will happen if a vCPU is scheduled to this CPU. Does
>> KVM indirectly catch it (e.g. vmenter fail) and return a deterministic error
>> to Qemu at some point or may it lead to undefined behavior? And is there
>> any method to prevent vCPU thread from being scheduled to the CPU?
>
> It should fail the first vmptrld instruction.  It will result in a few 
> WARN_ONCE and pr_warn_ratelimited (see vmx_insn_failed).  For VMX this 
> should be a pretty bad firmware bug, and it has never been reported. 
> KVM did find some undocumented errata but not this one!
>
> I don't think there's any fix other than pinning userspace.  The WARNs 
> can be eliminated by calling KVM_BUG_ON in the sched_in notifier, plus 
> checking if the VM is bugged before entering the guest or doing a 
> VMREAD/VMWRITE (usually the check is done only in a ioctl).  But some 
> refactoring is probably needed to make the code more robust in general.

Why is this hotplug callback in the CPU starting section to begin with?

If you stick it into the online section which runs on the hotplugged CPU
in thread context:

	CPUHP_AP_ONLINE_IDLE,

-->   	CPUHP_AP_KVM_STARTING,

	CPUHP_AP_SCHED_WAIT_EMPTY,

then it is allowed to fail and it still works in the right way.

When onlining a CPU then there cannot be any vCPU task run on the
CPU at that point.

When offlining a CPU then it's guaranteed that all user tasks and
non-pinned kernel tasks have left the CPU, i.e. there cannot be a vCPU
task around either.

Thanks,

        tglx