On 2024/7/1 16:43, Tian, Kevin wrote:
From: Zhao, Yan Y <yan.y.zhao@xxxxxxxxx>
Sent: Friday, June 28, 2024 11:19 PM
In the device cdev path, adjust the handling of the KVM reference count to
only increment with the first vfio_df_open() and decrement after the final
vfio_df_close(). This change addresses a KVM reference leak that occurs
when a device cdev file is opened multiple times and attempts to bind to
iommufd repeatedly.
Currently, vfio_df_get_kvm_safe() is invoked prior to each vfio_df_open()
in the cdev path during iommufd binding. The corresponding
vfio_device_put_kvm() is executed either when iommufd is unbound or if an
error occurs during the binding process.
However, issues arise when a device binds to iommufd more than once. The
second vfio_df_open() will fail during iommufd binding, and
vfio_device_put_kvm() will be triggered, setting device->kvm to NULL.
Consequently, when iommufd is unbound from the first successfully bound
device, vfio_device_put_kvm() becomes ineffective, leading to a leak in the
KVM reference count.
To be accurate this happens only when two binds are issued via different
fds otherwise below check will happen earlier when two binds are in a
same fd:
/* one device cannot be bound twice */
if (df->access_granted) {
ret = -EINVAL;
goto out_unlock;
}
yes
Below is the calltrace that will be produced in this scenario when the KVM
module is unloaded afterwards, reporting "BUG kvm_vcpu (Tainted: G S):
Objects remaining in kvm_vcpu on __kmem_cache_shutdown()".
Call Trace:
<TASK>
dump_stack_lvl+0x80/0xc0
slab_err+0xb0/0xf0
? __kmem_cache_shutdown+0xc1/0x4e0
? rcu_is_watching+0x11/0x50
? lock_acquired+0x144/0x3c0
__kmem_cache_shutdown+0x1b7/0x4e0
kmem_cache_destroy+0xa6/0x260
kvm_exit+0x80/0xc0 [kvm]
vmx_exit+0xe/0x20 [kvm_intel]
__x64_sys_delete_module+0x143/0x250
? ktime_get_coarse_real_ts64+0xd3/0xe0
? syscall_trace_enter+0x143/0x210
do_syscall_64+0x6f/0x140
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fixes: 5fcc26969a16 ("vfio: Add VFIO_DEVICE_BIND_IOMMUFD")
Signed-off-by: Yan Zhao <yan.y.zhao@xxxxxxxxx>
---
drivers/vfio/device_cdev.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c
index bb1817bd4ff3..3b85d01d1b27 100644
--- a/drivers/vfio/device_cdev.c
+++ b/drivers/vfio/device_cdev.c
@@ -65,6 +65,7 @@ long vfio_df_ioctl_bind_iommufd(struct
vfio_device_file *df,
{
struct vfio_device *device = df->device;
struct vfio_device_bind_iommufd bind;
+ bool put_kvm = false;
unsigned long minsz;
int ret;
@@ -101,12 +102,15 @@ long vfio_df_ioctl_bind_iommufd(struct
vfio_device_file *df,
}
/*
- * Before the device open, get the KVM pointer currently
+ * Before the device's first open, get the KVM pointer currently
* associated with the device file (if there is) and obtain
- * a reference. This reference is held until device closed.
+ * a reference. This reference is held until device's last closed.
* Save the pointer in the device for use by drivers.
*/
- vfio_df_get_kvm_safe(df);
+ if (device->open_count == 0) {
+ vfio_df_get_kvm_safe(df);
+ put_kvm = true;
+ }
ret = vfio_df_open(df);
if (ret)
@@ -129,7 +133,8 @@ long vfio_df_ioctl_bind_iommufd(struct
vfio_device_file *df,
out_close_device:
vfio_df_close(df);
out_put_kvm:
- vfio_device_put_kvm(device);
+ if (put_kvm)
+ vfio_device_put_kvm(device);
iommufd_ctx_put(df->iommufd);
df->iommufd = NULL;
out_unlock:
what about extending vfio_df_open() to unify the get/put_kvm()
and open_count trick in one place?
int vfio_df_open(struct vfio_device_file *df, struct kvm *kvm,
spinlock_t *kvm_ref_lock)
{
this should work. But need a comment to note why need pass in both kvm
and kvm_ref_lock given df has both of them. :)
struct vfio_device *device = df->device;
int ret = 0;
lockdep_assert_held(&device->dev_set->lock);
if (device->open_count == 0) {
spin_lock(kvm_ref_lock);
vfio_device_get_kvm_safe(device, kvm);
spin_unlock(kvm_ref_lock);
}
perhaps it can be put in the "if (device->open_count == 1)" branch, just
before invoking vfio_df_device_first_open().
/*
* Only the group path allows the device to be opened multiple
* times. The device cdev path doesn't have a secure way for it.
*/
if (device->open_count != 0 && !df->group)
return -EINVAL;
device->open_count++;
if (device->open_count == 1) {
ret = vfio_df_device_first_open(df);
if (ret)
device->open_count--;
}
if (ret)
vfio_device_put_kvm(device);
return ret;
}
void vfio_df_close(struct vfio_device_file *df)
{
struct vfio_device *device = df->device;
lockdep_assert_held(&device->dev_set->lock);
vfio_assert_device_open(device);
if (device->open_count == 1) {
vfio_df_device_last_close(df);
vfio_device_put_kvm(device);
}
device->open_count--;
}
--
Regards,
Yi Liu