From: Long Li <longli@xxxxxxxxxxxxx> Sent: Friday, July 9, 2021 12:44 PM > > > > From: Long Li <longli@xxxxxxxxxxxxx> Sent: Friday, July 2, 2021 4:59 PM > > > > PM > > > > > > > > [snip] > > > > > > > > > > > +static void az_blob_remove_device(struct az_blob_device *dev) { > > > > > > > + wait_event(dev->file_wait, list_empty(&dev->file_list)); > > > > > > > + misc_deregister(&az_blob_misc_device); > > > > > > > +#ifdef CONFIG_DEBUG_FS > > > > > > > + debugfs_remove_recursive(az_blob_debugfs_root); > > > > > > > +#endif > > > > > > > + /* At this point, we won't get any requests from user-mode > > > > > > > +*/ } > > > > > > > + > > > > > > > +static int az_blob_create_device(struct az_blob_device *dev) { > > > > > > > + int rc; > > > > > > > + struct dentry *d; > > > > > > > + > > > > > > > + rc = misc_register(&az_blob_misc_device); > > > > > > > + if (rc) { > > > > > > > + az_blob_err("misc_register failed rc %d\n", rc); > > > > > > > + return rc; > > > > > > > + } > > > > > > > + > > > > > > > +#ifdef CONFIG_DEBUG_FS > > > > > > > + az_blob_debugfs_root = debugfs_create_dir("az_blob", > > NULL); > > > > > > > + if (!IS_ERR_OR_NULL(az_blob_debugfs_root)) { > > > > > > > + d = debugfs_create_file("pending_requests", 0400, > > > > > > > + az_blob_debugfs_root, NULL, > > > > > > > + &az_blob_debugfs_fops); > > > > > > > + if (IS_ERR_OR_NULL(d)) { > > > > > > > + az_blob_warn("failed to create debugfs > > file\n"); > > > > > > > + > > debugfs_remove_recursive(az_blob_debugfs_root); > > > > > > > + az_blob_debugfs_root = NULL; > > > > > > > + } > > > > > > > + } else > > > > > > > + az_blob_warn("failed to create debugfs root\n"); > > #endif > > > > > > > + > > > > > > > + return 0; > > > > > > > +} > > > > > > > + > > > > > > > +static int az_blob_connect_to_vsp(struct hv_device *device, > > > > > > > +u32 > > > > > > > +ring_size) { > > > > > > > + int ret; > > > > > > > + > > > > > > > + spin_lock_init(&az_blob_dev.file_lock); > > > > > > > > > > > > I'd argue that the spin lock should not be re-initialized here. > > > > > > Here's the sequence where things go wrong: > > > > > > > > > > > > 1) The driver is unbound, so az_blob_remove() is called. > > > > > > 2) az_blob_remove() sets the "removing" flag to true, and calls > > > > > > az_blob_remove_device(). > > > > > > 3) az_blob_remove_device() waits for the file_list to become empty. > > > > > > 4) After the file_list becomes empty, but before > > > > > > misc_deregister() is called, a separate thread opens the device again. > > > > > > 5) In the separate thread, az_blob_fop_open() obtains the > > > > > > file_lock spin > > > > lock. > > > > > > 6) Before az_blob_fop_open() releases the spin lock, > > > > > > az_blob_remove_device() completes, along with az_blob_remove(). > > > > > > 7) Then the device gets rebound, and az_blob_connect_to_vsp() > > > > > > gets called, all while az_blob_fop_open() still holds the spin > > > > > > lock. So the spin lock get re- initialized while it is held. > > > > > > > > > > > > This is admittedly a far-fetched scenario, but stranger things > > > > > > have happened. :-) The issue is that you are counting on the > > > > > > az_blob_dev structure to persist and have a valid file_lock, > > > > > > even while the device is unbound. So any initialization should > > > > > > only happen in > > > > az_blob_drv_init(). > > > > > > > > > > I'm not sure if az_blob_probe() and az_blob_remove() can be called > > > > > at the same time, as az_blob_remove_vmbus() is called the last in > > > > az_blob_remove(). > > > > > Is it possible for vmbus asking the driver to probe a new channel > > > > > before the old channel is closed? I expect the vmbus provide > > > > > guarantee that those calls are made in sequence. > > > > > > > > In my scenario above, az_blob_remove_vmbus() and az_blob_remove() > > > > run to completion in Step #6, all while some other thread is still > > > > in the middle of an > > > > open() call and holding the file_lock spin lock. Then in Step #7 > > > > az_blob_probe() runs. So az_blob_remove() and az_blob_probe() > > > > execute sequentially, not at the same time. > > > > > > > > Michael > > > > > > I think it's a valid scenario. The return value of > > > devtmpfs_delete_node() is not checked in misc_deregister(). It decreases > > the refcount on inodes but it's not guaranteed that someone else is still using > > it (in the middle of opening a file). > > > > > > However, this works fine for "rmmod" that causes device to be removed. > > > Before file is opened the refcount on the module is increased so it can't be > > removed when file is being opened. The scenario you described can't > > happen. > > > > > > But during VMBUS rescind, it can happen. It's possible that the driver > > > is using the spinlock that has been re-initialized, when the next VMBUS > > offer on the same channel comes before all the attempting open file calls > > exit. > > > > I was thinking about the rescind scenario. vmbus_onoffer_rescind() will run > > on the global workqueue. If it eventually calls az_blob_remove() and then > > az_blob_remove_device(), it will wait until the file_list is empty, which > > essentially means waiting until user space processes decide to close the > > instances they have open. This seems like a problem that could block the > > global workqueue for a long time and > > thereby hang the kernel. Is my reasoning valid? If so, I haven't > > thought about what the solution might be. It seems like we do need to wait > > until any in-progress requests to Hyper-V are complete because Hyper-V has > > references to guest physical memory. But waiting for all open instances to > > be closed seems to be problematic. > > My tests showed that misc_deregister() caused all opened files to be released if there are no pending I/O waiting in the > driver. > > If there are pending I/O, we must wait as the VSP owns the memory of the I/O. The correct VSP behavior is to return all the > pending I/O along with rescind. This is the same to what storvsc does for rescind. Yes, we have to wait since the VSP owns the memory. But you make a good point that the VSP behavior should be to return all the pending I/Os at roughly the same time as the rescind. > > It looks to me waiting for opened files after the call to misc_deregister(), but before removing the vmbus channel is a safe > approach. To me, this would be a great solution. And it closes the original timing window I pointed out where some other thread could open the device after having waited for the file_list to be empty, but before misc_deregister() is called. With the order swapped, once the file_list is empty, there's no way for the device to be opened again. So it solves the problem with the spin_lock initialization. > > If the VSP is behaving correctly, the rescind process should not block for too long. If we want to deal with a buggy VSP that > takes forever to release a resource, we want to create a work queue for rescind handling. I don't think we need to deal with a buggy VSP holding memory for an excessively long time. Michael > > > > > > This is a very rare. I agree things happen that we should make sure the > > driver can handle this. I'll update the driver. > > > > > > Long > > > > > > > > > > > > > > > > > > > > > > > > > + INIT_LIST_HEAD(&az_blob_dev.file_list); > > > > > > > + init_waitqueue_head(&az_blob_dev.file_wait); > > > > > > > + > > > > > > > + az_blob_dev.removing = false; > > > > > > > + > > > > > > > + az_blob_dev.device = device; > > > > > > > + device->channel->rqstor_size = device_queue_depth; > > > > > > > + > > > > > > > + ret = vmbus_open(device->channel, ring_size, ring_size, > > NULL, 0, > > > > > > > + az_blob_on_channel_callback, device- > > >channel); > > > > > > > + > > > > > > > + if (ret) { > > > > > > > + az_blob_err("failed to connect to VSP ret %d\n", ret); > > > > > > > + return ret; > > > > > > > + } > > > > > > > + > > > > > > > + hv_set_drvdata(device, &az_blob_dev); > > > > > > > + > > > > > > > + return ret; > > > > > > > +} > > > > > > > + > > > > > > > +static void az_blob_remove_vmbus(struct hv_device *device) { > > > > > > > + /* At this point, no VSC/VSP traffic is possible over vmbus */ > > > > > > > + hv_set_drvdata(device, NULL); > > > > > > > + vmbus_close(device->channel); } > > > > > > > + > > > > > > > +static int az_blob_probe(struct hv_device *device, > > > > > > > + const struct hv_vmbus_device_id *dev_id) { > > > > > > > + int rc; > > > > > > > + > > > > > > > + az_blob_dbg("probing device\n"); > > > > > > > + > > > > > > > + rc = az_blob_connect_to_vsp(device, > > az_blob_ringbuffer_size); > > > > > > > + if (rc) { > > > > > > > + az_blob_err("error connecting to VSP rc %d\n", rc); > > > > > > > + return rc; > > > > > > > + } > > > > > > > + > > > > > > > + // create user-mode client library facing device > > > > > > > + rc = az_blob_create_device(&az_blob_dev); > > > > > > > + if (rc) { > > > > > > > + az_blob_remove_vmbus(device); > > > > > > > + return rc; > > > > > > > + } > > > > > > > + > > > > > > > + az_blob_dbg("successfully probed device\n"); > > > > > > > + return 0; > > > > > > > +} > > > > > > > + > > > > > > > +static int az_blob_remove(struct hv_device *dev) { > > > > > > > + struct az_blob_device *device = hv_get_drvdata(dev); > > > > > > > + unsigned long flags; > > > > > > > + > > > > > > > + spin_lock_irqsave(&device->file_lock, flags); > > > > > > > + device->removing = true; > > > > > > > + spin_unlock_irqrestore(&device->file_lock, flags); > > > > > > > + > > > > > > > + az_blob_remove_device(device); > > > > > > > + az_blob_remove_vmbus(dev); > > > > > > > + return 0; > > > > > > > +} > > > > > > > + > > > > > > > +static struct hv_driver az_blob_drv = { > > > > > > > + .name = KBUILD_MODNAME, > > > > > > > + .id_table = id_table, > > > > > > > + .probe = az_blob_probe, > > > > > > > + .remove = az_blob_remove, > > > > > > > + .driver = { > > > > > > > + .probe_type = PROBE_PREFER_ASYNCHRONOUS, > > > > > > > + }, > > > > > > > +}; > > > > > > > + > > > > > > > +static int __init az_blob_drv_init(void) { > > > > > > > + int ret; > > > > > > > + > > > > > > > + ret = vmbus_driver_register(&az_blob_drv); > > > > > > > + return ret; > > > > > > > +} > > > > > > > + > > > > > > > +static void __exit az_blob_drv_exit(void) { > > > > > > > + vmbus_driver_unregister(&az_blob_drv); > > > > > > > +}