Re: [PATCH v2] KVM: PPC: Book3S HV: XIVE: Prevent races when releasing device

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 4/29/19 3:24 AM, Paul Mackerras wrote:
> Now that we have the possibility of a XIVE or XICS-on-XIVE device being
> released while the VM is still running, we need to be careful about
> races and potential use-after-free bugs.  Although the kvmppc_xive
> struct is not freed, but kept around for re-use, the kvmppc_xive_vcpu
> structs are freed, and they are used extensively in both the XIVE native
> and XICS-on-XIVE code.
> 
> There are various ways in which XIVE code gets invoked:
> 
> - VCPU entry and exit, which do push and pull operations on the XIVE hardware
> - one_reg get and set functions (vcpu->mutex is held)
> - XICS hypercalls (but only inside guest execution, not from
>   kvmppc_pseries_do_hcall)
> - device creation calls (kvm->lock is held)
> - device callbacks - get/set attribute, mmap, pagefault, release/destroy
> - set_mapped/clr_mapped calls (kvm->lock is held)
> - connect_vcpu calls
> - debugfs file read callbacks
> 
> Inside a device release function, we know that userspace cannot have an
> open file descriptor referring to the device, nor can it have any mmapped
> regions from the device.  Therefore the device callbacks are excluded,
> as are the connect_vcpu calls (since they need a fd for the device).
> Further, since the caller holds the kvm->lock mutex, no other device
> creation calls or set/clr_mapped calls can be executing concurrently.
> 
> To exclude VCPU execution and XICS hypercalls, we temporarily set
> kvm->arch.mmu_ready to 0.  This forces any VCPU task that is trying to
> enter the guest to take the kvm->lock mutex, which is held by the caller
> of the release function.  Then, sending an IPI to all other CPUs forces
> any VCPU currently executing in the guest to exit.

For my understanding, this method is faster than looping on all vCPUs
and calling kvm_vcpu_kick() ? 

> 
> Finally, we take the vcpu->mutex for each VCPU around the process of
> cleaning up and freeing its XIVE data structures, in order to exclude
> any one_reg get/set calls.
> 
> To exclude the debugfs read callbacks, we just need to ensure that
> debugfs_remove is called before freeing any data structures.  Once it
> returns we know that no CPU can be executing the callbacks (for our
> kvmppc_xive instance).
> 
> Signed-off-by: Paul Mackerras <paulus@xxxxxxxxxx>

LGTM, 

Reviewed-by: Cédric Le Goater <clg@xxxxxxxx>

one minor comment below,

Thanks,

C.

> ---
> v2: move debugfs_remove call to eliminate race
> 
>  arch/powerpc/kvm/book3s_xive.c        | 51 ++++++++++++++++++++++++++++-------
>  arch/powerpc/kvm/book3s_xive_native.c | 43 ++++++++++++++++++++++++-----
>  2 files changed, 78 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> index 922689b..4280cd8 100644
> --- a/arch/powerpc/kvm/book3s_xive.c
> +++ b/arch/powerpc/kvm/book3s_xive.c
> @@ -846,7 +846,8 @@ int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval)
>  
>  	/*
>  	 * We can't update the state of a "pushed" VCPU, but that
> -	 * shouldn't happen.
> +	 * shouldn't happen because the vcpu->mutex makes running a
> +	 * vcpu mutually exclusive with doing one_reg get/set on it.
>  	 */
>  	if (WARN_ON(vcpu->arch.xive_pushed))
>  		return -EIO;
> @@ -1835,7 +1836,7 @@ void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb)
>  }
>  
>  /*
> - * Called when device fd is closed
> + * Called when device fd is closed.  kvm->lock is held.
>   */
>  static void kvmppc_xive_release(struct kvm_device *dev)
>  {
> @@ -1843,21 +1844,46 @@ static void kvmppc_xive_release(struct kvm_device *dev)
>  	struct kvm *kvm = xive->kvm;
>  	struct kvm_vcpu *vcpu;
>  	int i;
> +	int was_ready;
>  
>  	pr_devel("Releasing xive device\n");
>  
> +	debugfs_remove(xive->dentry);
> +
>  	/*
> -	 * When releasing the KVM device fd, the vCPUs can still be
> -	 * running and we should clean up the vCPU interrupt
> -	 * presenters first.
> +	 * Clearing mmu_ready temporarily while holding kvm->lock
> +	 * is a way of ensuring that no vcpus can enter the guest
> +	 * until we drop kvm->lock.  Doing kick_all_cpus_sync()
> +	 * ensures that any vcpu executing inside the guest has
> +	 * exited the guest.  Once kick_all_cpus_sync() has finished,
> +	 * we know that no vcpu can be executing the XIVE push or
> +	 * pull code, or executing a XICS hcall.
> +	 *
> +	 * Since this is the device release function, we know that
> +	 * userspace does not have any open fd referring to the
> +	 * device.  Therefore there can not be any of the device
> +	 * attribute set/get functions being executed concurrently,
> +	 * and similarly, the connect_vcpu and set/clr_mapped
> +	 * functions also cannot be being executed.
>  	 */
> -	kvm_for_each_vcpu(i, vcpu, kvm)
> -		kvmppc_xive_cleanup_vcpu(vcpu);
> +	was_ready = kvm->arch.mmu_ready;
> +	kvm->arch.mmu_ready = 0;
> +	kick_all_cpus_sync();
>  
> -	debugfs_remove(xive->dentry);
> +	/*
> +	 * We should clean up the vCPU interrupt presenters first.
> +	 */
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		/*
> +		 * Take vcpu->mutex to ensure that no one_reg get/set ioctl
> +		 * (i.e. kvmppc_xive_[gs]et_icp) can be done concurrently.
> +		 */
> +		mutex_lock(&vcpu->mutex);
> +		kvmppc_xive_cleanup_vcpu(vcpu);
> +		mutex_unlock(&vcpu->mutex);
> +	}
>  
> -	if (kvm)
> -		kvm->arch.xive = NULL;
> +	kvm->arch.xive = NULL;
>  
>  	/* Mask and free interrupts */
>  	for (i = 0; i <= xive->max_sbid; i++) {
> @@ -1870,6 +1896,8 @@ static void kvmppc_xive_release(struct kvm_device *dev)
>  	if (xive->vp_base != XIVE_INVALID_VP)
>  		xive_native_free_vp_block(xive->vp_base);
>  
> +	kvm->arch.mmu_ready = was_ready;
> +
>  	/*
>  	 * A reference of the kvmppc_xive pointer is now kept under
>  	 * the xive_devices struct of the machine for reuse. It is
> @@ -1906,6 +1934,9 @@ struct kvmppc_xive *kvmppc_xive_get_device(struct kvm *kvm, u32 type)
>  	return xive;
>  }
>  
> +/*
> + * Create a XICS device with XIVE backend.  kvm->lock is held.
> + */
>  static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
>  {
>  	struct kvmppc_xive *xive;
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> index 0497272a..5e14df1 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c

May be also add the comment :

kvm->lock is held when kvmppc_xive_native_release() is called.


> @@ -973,21 +973,47 @@ static void kvmppc_xive_native_release(struct kvm_device *dev)
>  	struct kvm *kvm = xive->kvm;
>  	struct kvm_vcpu *vcpu;
>  	int i;
> +	int was_ready;
>  
>  	debugfs_remove(xive->dentry);
>  
>  	pr_devel("Releasing xive native device\n");
>  
>  	/*
> -	 * When releasing the KVM device fd, the vCPUs can still be
> -	 * running and we should clean up the vCPU interrupt
> -	 * presenters first.
> +	 * Clearing mmu_ready temporarily while holding kvm->lock
> +	 * is a way of ensuring that no vcpus can enter the guest
> +	 * until we drop kvm->lock.  Doing kick_all_cpus_sync()
> +	 * ensures that any vcpu executing inside the guest has
> +	 * exited the guest.  Once kick_all_cpus_sync() has finished,
> +	 * we know that no vcpu can be executing the XIVE push or
> +	 * pull code or accessing the XIVE MMIO regions.
> +	 *
> +	 * Since this is the device release function, we know that
> +	 * userspace does not have any open fd or mmap referring to
> +	 * the device.  Therefore there can not be any of the
> +	 * device attribute set/get, mmap, or page fault functions
> +	 * being executed concurrently, and similarly, the
> +	 * connect_vcpu and set/clr_mapped functions also cannot
> +	 * be being executed.
>  	 */
> -	kvm_for_each_vcpu(i, vcpu, kvm)
> +	was_ready = kvm->arch.mmu_ready;
> +	kvm->arch.mmu_ready = 0;
> +	kick_all_cpus_sync();
> +
> +	/*
> +	 * We should clean up the vCPU interrupt presenters first.
> +	 */
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		/*
> +		 * Take vcpu->mutex to ensure that no one_reg get/set ioctl
> +		 * (i.e. kvmppc_xive_native_[gs]et_vp) can be being done.
> +		 */
> +		mutex_lock(&vcpu->mutex);
>  		kvmppc_xive_native_cleanup_vcpu(vcpu);
> +		mutex_unlock(&vcpu->mutex);
> +	}
>  
> -	if (kvm)
> -		kvm->arch.xive = NULL;
> +	kvm->arch.xive = NULL;
>  
>  	for (i = 0; i <= xive->max_sbid; i++) {
>  		if (xive->src_blocks[i])
> @@ -999,6 +1025,8 @@ static void kvmppc_xive_native_release(struct kvm_device *dev)
>  	if (xive->vp_base != XIVE_INVALID_VP)
>  		xive_native_free_vp_block(xive->vp_base);
>  
> +	kvm->arch.mmu_ready = was_ready;
> +
>  	/*
>  	 * A reference of the kvmppc_xive pointer is now kept under
>  	 * the xive_devices struct of the machine for reuse. It is
> @@ -1009,6 +1037,9 @@ static void kvmppc_xive_native_release(struct kvm_device *dev)
>  	kfree(dev);
>  }
>  
> +/*
> + * Create a XIVE device.  kvm->lock is held.
> + */
>  static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
>  {
>  	struct kvmppc_xive *xive;
> 




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux