On Mon, Jul 01, 2024 at 01:08:30PM +0800, Yi Liu wrote: > On 2024/6/28 23:18, Yan Zhao wrote: > > In the device cdev path, adjust the handling of the KVM reference count to > > only increment with the first vfio_df_open() and decrement after the final > > vfio_df_close(). This change addresses a KVM reference leak that occurs > > when a device cdev file is opened multiple times and attempts to bind to > > iommufd repeatedly. > > > > Currently, vfio_df_get_kvm_safe() is invoked prior to each vfio_df_open() > > in the cdev path during iommufd binding. The corresponding > > vfio_device_put_kvm() is executed either when iommufd is unbound or if an > > error occurs during the binding process. > > > > However, issues arise when a device binds to iommufd more than once. The > > second vfio_df_open() will fail during iommufd binding, and > > vfio_device_put_kvm() will be triggered, setting device->kvm to NULL. > > Consequently, when iommufd is unbound from the first successfully bound > > device, vfio_device_put_kvm() becomes ineffective, leading to a leak in the > > KVM reference count. > > Good catch!!! > > > Below is the calltrace that will be produced in this scenario when the KVM > > module is unloaded afterwards, reporting "BUG kvm_vcpu (Tainted: G S): > > Objects remaining in kvm_vcpu on __kmem_cache_shutdown()". > > > > Call Trace: > > <TASK> > > dump_stack_lvl+0x80/0xc0 > > slab_err+0xb0/0xf0 > > ? __kmem_cache_shutdown+0xc1/0x4e0 > > ? rcu_is_watching+0x11/0x50 > > ? lock_acquired+0x144/0x3c0 > > __kmem_cache_shutdown+0x1b7/0x4e0 > > kmem_cache_destroy+0xa6/0x260 > > kvm_exit+0x80/0xc0 [kvm] > > vmx_exit+0xe/0x20 [kvm_intel] > > __x64_sys_delete_module+0x143/0x250 > > ? ktime_get_coarse_real_ts64+0xd3/0xe0 > > ? syscall_trace_enter+0x143/0x210 > > do_syscall_64+0x6f/0x140 > > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > > > Fixes: 5fcc26969a16 ("vfio: Add VFIO_DEVICE_BIND_IOMMUFD") > > Signed-off-by: Yan Zhao <yan.y.zhao@xxxxxxxxx> > > --- > > drivers/vfio/device_cdev.c | 13 +++++++++---- > > 1 file changed, 9 insertions(+), 4 deletions(-) > > > > diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c > > index bb1817bd4ff3..3b85d01d1b27 100644 > > --- a/drivers/vfio/device_cdev.c > > +++ b/drivers/vfio/device_cdev.c > > @@ -65,6 +65,7 @@ long vfio_df_ioctl_bind_iommufd(struct vfio_device_file *df, > > { > > struct vfio_device *device = df->device; > > struct vfio_device_bind_iommufd bind; > > + bool put_kvm = false; > > unsigned long minsz; > > int ret; > > @@ -101,12 +102,15 @@ long vfio_df_ioctl_bind_iommufd(struct vfio_device_file *df, > > } > > /* > > - * Before the device open, get the KVM pointer currently > > + * Before the device's first open, get the KVM pointer currently > > * associated with the device file (if there is) and obtain > > - * a reference. This reference is held until device closed. > > + * a reference. This reference is held until device's last closed. > > * Save the pointer in the device for use by drivers. > > */ > > - vfio_df_get_kvm_safe(df); > > + if (device->open_count == 0) { > > + vfio_df_get_kvm_safe(df); > > + put_kvm = true; > > + } > > ret = vfio_df_open(df); > > if (ret) > > @@ -129,7 +133,8 @@ long vfio_df_ioctl_bind_iommufd(struct vfio_device_file *df, > > out_close_device: > > vfio_df_close(df); > > out_put_kvm: > > - vfio_device_put_kvm(device); > > + if (put_kvm) > > you may use if (device->open_count == 0) as well here to save a bool. > Otherwise looks good to me. Upon here, device->open_count is not necessarily 0 even for the first open. The failure can be after a successful increment of device->open_count. Maybe renaming "bool put_kvm" to "bool is_first_open" to save an assignment in most common case? > Reviewed-by: Yi Liu <yi.l.liu@xxxxxxxxx> Thanks:) > > > + vfio_device_put_kvm(device); > > iommufd_ctx_put(df->iommufd); > > df->iommufd = NULL; > > out_unlock: > > > > base-commit: 6ba59ff4227927d3a8530fc2973b80e94b54d58f > > BTW. The vfio_device_get_kvm_safe() is not supposed to be invoked multiple > times by design as the second call would override the device->put_kvm and > device->kvm. This does not change the put_kvm nor the kvm though. But not a "kvm" may also be changed if the second bind is from a different VM. > good thing anyghow. maybe worth a warn like below. > > diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c > index ee72c0b61795..a4bead0e5820 100644 > --- a/drivers/vfio/vfio_main.c > +++ b/drivers/vfio/vfio_main.c > @@ -408,6 +408,8 @@ void vfio_device_get_kvm_safe(struct vfio_device > *device, struct kvm *kvm) > if (!kvm) > return; > > + WARN_ON(device->put_kvm || device->kvm); Yes, better. > pfn = symbol_get(kvm_put_kvm); > if (WARN_ON(!pfn)) > return; > > -- > Regards, > Yi Liu