> Subject: RE: [Patch v3 2/3] Drivers: hv: add Azure Blob driver > > From: Long Li <longli@xxxxxxxxxxxxx> Sent: Friday, July 16, 2021 12:27 PM > > [snip] > > > > > + > > > > +static int az_blob_connect_to_vsp(struct hv_device *device, u32 > > > > +ring_size) { > > > > + int ret; > > > > + > > > > + spin_lock_init(&az_blob_dev.file_lock); > > > > + INIT_LIST_HEAD(&az_blob_dev.file_list); > > > > + init_waitqueue_head(&az_blob_dev.file_wait); > > > > + > > > > + az_blob_dev.device = device; > > > > + device->channel->rqstor_size = AZ_BLOB_QUEUE_DEPTH; > > > > + > > > > + ret = vmbus_open(device->channel, ring_size, ring_size, NULL, 0, > > > > + az_blob_on_channel_callback, device->channel); > > > > + > > > > + if (ret) > > > > + return ret; > > > > + > > > > + hv_set_drvdata(device, &az_blob_dev); > > > > + > > > > + return ret; > > > > +} > > > > + > > > > +static void az_blob_remove_vmbus(struct hv_device *device) { > > > > + /* At this point, no VSC/VSP traffic is possible over vmbus */ > > > > + hv_set_drvdata(device, NULL); > > > > + vmbus_close(device->channel); > > > > +} > > > > + > > > > +static int az_blob_probe(struct hv_device *device, > > > > + const struct hv_vmbus_device_id *dev_id) { > > > > + int ret; > > > > + > > > > + ret = az_blob_connect_to_vsp(device, AZ_BLOB_RING_SIZE); > > > > + if (ret) { > > > > + az_blob_err("error connecting to VSP ret=%d\n", ret); > > > > + return ret; > > > > + } > > > > + > > > > + // create user-mode client library facing device > > > > + ret = az_blob_create_device(&az_blob_dev); > > > > + if (ret) { > > > > + az_blob_err("failed to create device ret=%d\n", ret); > > > > + az_blob_remove_vmbus(device); > > > > + return ret; > > > > + } > > > > + > > > > + az_blob_dev.removing = false; > > > > > > This line seems misplaced. As soon as az_blob_create_device() > > > returns, some other thread could find the device and open it. You > > > don't want the > > > az_blob_fop_open() function to find the "removing" > > > flag set to true. So I think this line should go *before* the call > > > to az_blob_create_device(). > > > > > > > + az_blob_info("successfully probed device\n"); > > > > + > > > > + return 0; > > > > +} > > > > + > > > > +static int az_blob_remove(struct hv_device *dev) { > > > > + az_blob_dev.removing = true; > > > > + /* > > > > + * RCU lock guarantees that any future calls to az_blob_fop_open() > > > > + * can not use device resources while the inode reference of > > > > + * /dev/azure_blob is being held for that open, and device file is > > > > + * being removed from /dev. > > > > + */ > > > > + synchronize_rcu(); > > > > > > I don't think this works as you have described. While it will > > > ensure that any in-progress RCU read-side critical sections have > > > completed (i.e., in > > > az_blob_fop_open() ), it does not prevent new read-side critical > > > sections from being started. So as soon as synchronize_rcu() is > > > finished, another thread could find and open the device, and be > > > executing in az_blob_fop_open(). > > > > > > But it's not clear to me that this (and the rcu_read_locks in > > > az_blob_fop_open) are really needed anyway. If > > > az_blob_remove_device() is called while one or more threads have it > > > open, I think that's OK. Or is there a scenario that I'm missing? > > > > I was trying to address your comment earlier. Here were your comments (1 > - 7): > > > > 1) The driver is unbound, so az_blob_remove() is called. > > 2) az_blob_remove() sets the "removing" flag to true, and calls > az_blob_remove_device(). > > 3) az_blob_remove_device() waits for the file_list to become empty. > > 4) After the file_list becomes empty, but before misc_deregister() is called, > a separate thread opens the device again. > > 5) In the separate thread, az_blob_fop_open() obtains the file_lock spin > lock. > > 6) Before az_blob_fop_open() releases the spin lock, az > > block_remove_device() completes, along with az_blob_remove(). > > 7) Then the device gets rebound, and az_blob_connect_to_vsp() gets > > called, all while az_blob_fop_open() still holds the spin lock. So the spin > lock get re-initialized while it is held. > > > > Between step 4 and step 5, I don't see any guarantee that > > az_blob_fop_open() can't run concurrently on another CPU after > > misc_deregister() finishes. misc_deregister() calls > > devtmpfs_delete_node() to remove the device file from /dev/*, but it > doesn't check the return value, so the inode reference number can be non- > zero after it returns, somebody may still try to open it. > > I'm no expert here, but once misc_deregister() finishes, I would expect that > any attempt to open the device with the assigned major:minor number will > fail. However, existing opens may still be active. So another thread could > still > have it open, but no new opens can be done. As coded in this version of > your patch, az_blob_remove() waits until all those existing opens have been > closed, > so everything is clean once the wait is complete. Of course, while waiting > here, the threads with the open fd could try to start another operation > against the VSP, but the "removing" flag will prevent that. So the only access > these threads will have is to the singleton az_blob_dev instance and the list > of open files. > > But as noted previously, waiting here for the opens to be closed is > problematic because the wait could be arbitrarily long. And there's > messiness if the device gets re-added later, because there's no distinction > between the old instantiation that a thread might still have open vs. the new > instantiation that you want to initialize fresh. That's where creating a new > dynamic instance will solve any problems. As I said in my previous email, I'm moving to dynamically allocated objects to fix all of these. > > > This check guarantees that the code can't reference any driver's internal > data structures. > > az_blob_dev.removing is set so this code can't be entered. Resetting > > it after az_blob_create_device() is also for this purpose. > > > > > > > > > + > > > > + az_blob_info("removing device\n"); > > > > + az_blob_remove_device(); > > > > + > > > > + /* > > > > + * At this point, it's not possible to open more files. > > > > + * Wait for all the opened files to be released. > > > > + */ > > > > + wait_event(az_blob_dev.file_wait, > > > > +list_empty(&az_blob_dev.file_list)); > > > > > > As mentioned in my most recent comments on the previous version of > > > the code, I'm concerned about waiting for all open files to be > > > released in the case of a VMbus rescind. We definitely have to wait > > > for all VSP operations to complete, but that's different from > > > waiting for the files to be closed. The former depends on Hyper-V > > > being well-behaved and will presumably happen quickly in the case of > > > a rescind. But the latter depends entirely on user space code that > > > is out of the kernel's control. The user space process could hang > > > around for hours or days with the file still open, which would block this > function from completing, and hence block the global work queue. > > > > > > Thinking about this, the core issue may be that having a single > > > static instance of az_blob_device is problematic if we allow the > > > device to be removed (either explicitly with an unbind, or > > > implicitly with a VMbus > > > rescind) and then re-added. Perhaps what needs to happen is that > > > the removed device is allowed to continue to exist as long as user > > > space processes have an open file handle, but they can't perform any > > > operations because the "removing" flag is set (and stays set). > > > If the device is re-added, then a new instance of az_blob_device > > > should be created, and whether or not the old instance is still > > > hanging around is irrelevant. > > > > I agree dynamic device objects is the way to go. Will implement this. > > > > > > > > > + > > > > + az_blob_remove_vmbus(dev); > > > > + return 0; > > > > +} > > > > + > > > > +static struct hv_driver az_blob_drv = { > > > > + .name = KBUILD_MODNAME, > > > > + .id_table = id_table, > > > > + .probe = az_blob_probe, > > > > + .remove = az_blob_remove, > > > > + .driver = { > > > > + .probe_type = PROBE_PREFER_ASYNCHRONOUS, > > > > + }, > > > > +}; > > > > + > > > > +static int __init az_blob_drv_init(void) { > > > > + return vmbus_driver_register(&az_blob_drv); > > > > +} > > > > + > > > > +static void __exit az_blob_drv_exit(void) { > > > > + vmbus_driver_unregister(&az_blob_drv); > > > > +} > > > > + > > > > +MODULE_LICENSE("Dual BSD/GPL"); > > > > +MODULE_DESCRIPTION("Microsoft Azure Blob driver"); > > > > +module_init(az_blob_drv_init); module_exit(az_blob_drv_exit); > > > > diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c > > > > index 705e95d..3095611 100644 > > > > --- a/drivers/hv/channel_mgmt.c > > > > +++ b/drivers/hv/channel_mgmt.c > > > > @@ -154,6 +154,13 @@ > > > > .allowed_in_isolated = false, > > > > }, > > > > > > > > + /* Azure Blob */ > > > > + { .dev_type = HV_AZURE_BLOB, > > > > + HV_AZURE_BLOB_GUID, > > > > + .perf_device = false, > > > > + .allowed_in_isolated = false, > > > > + }, > > > > + > > > > /* Unknown GUID */ > > > > { .dev_type = HV_UNKNOWN, > > > > .perf_device = false, > > > > diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h index > > > > d1e59db..ac31362 100644 > > > > --- a/include/linux/hyperv.h > > > > +++ b/include/linux/hyperv.h > > > > @@ -772,6 +772,7 @@ enum vmbus_device_type { > > > > HV_FCOPY, > > > > HV_BACKUP, > > > > HV_DM, > > > > + HV_AZURE_BLOB, > > > > HV_UNKNOWN, > > > > }; > > > > > > > > @@ -1350,6 +1351,14 @@ int vmbus_allocate_mmio(struct resource > > > **new, struct hv_device *device_obj, > > > > 0x72, 0xe2, 0xff, 0xb1, 0xdc, 0x7f) > > > > > > > > /* > > > > + * Azure Blob GUID > > > > + * {0590b792-db64-45cc-81db-b8d70c577c9e} > > > > + */ > > > > +#define HV_AZURE_BLOB_GUID \ > > > > + .guid = GUID_INIT(0x0590b792, 0xdb64, 0x45cc, 0x81, 0xdb, \ > > > > + 0xb8, 0xd7, 0x0c, 0x57, 0x7c, 0x9e) > > > > + > > > > +/* > > > > * Shutdown GUID > > > > * {0e0b6031-5213-4934-818b-38d90ced39db} > > > > */ > > > > diff --git a/include/uapi/misc/azure_blob.h > > > > b/include/uapi/misc/azure_blob.h new file mode 100644 index > > > > 0000000..f4168679 > > > > --- /dev/null > > > > +++ b/include/uapi/misc/azure_blob.h > > > > @@ -0,0 +1,34 @@ > > > > +/* SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0-only WITH > > > > +Linux-syscall-note */ > > > > +/* Copyright (c) Microsoft Corporation. */ > > > > + > > > > +#ifndef _AZ_BLOB_H > > > > +#define _AZ_BLOB_H > > > > + > > > > +#include <linux/kernel.h> > > > > +#include <linux/uuid.h> > > > > + > > > > +/* user-mode sync request sent through ioctl */ struct > > > > +az_blob_request_sync_response { > > > > + __u32 status; > > > > + __u32 response_len; > > > > +}; > > > > + > > > > +struct az_blob_request_sync { > > > > + guid_t guid; > > > > + __u32 timeout; > > > > + __u32 request_len; > > > > + __u32 response_len; > > > > + __u32 data_len; > > > > + __u32 data_valid; > > > > + __aligned_u64 request_buffer; > > > > > > Is there an implied 32-bit padding field before "request_buffer"? > > > It seems like "yes", since there are five 32-bit field preceding. > > > Would it be better to explicitly list that padding field? > > > > > > > + __aligned_u64 response_buffer; > > > > + __aligned_u64 data_buffer; > > > > + struct az_blob_request_sync_response response; }; > > > > + > > > > +#define AZ_BLOB_MAGIC_NUMBER 'R' > > > > +#define IOCTL_AZ_BLOB_DRIVER_USER_REQUEST \ > > > > + _IOWR(AZ_BLOB_MAGIC_NUMBER, 0xf0, \ > > > > + struct az_blob_request_sync) > > > > + > > > > +#endif /* define _AZ_BLOB_H */ > > > > -- > > > > 1.8.3.1