On 09/09/2013 11:15 AM, Josh Durgin wrote: > On 09/09/2013 06:37 AM, Alex Elder wrote: >> On 09/09/2013 02:17 AM, Josh Durgin wrote: >>> Removing a device deallocates the disk, unschedules the watch, and >>> finally cleans up the rbd_dev structure. rbd_dev_refresh(), called >>> from the watch callback, updates the disk size and rbd_dev >>> structure. With no locking between them, rbd_dev_refresh() may use the >>> device or rbd_dev after they've been freed. . . . >>> + if (!removing) { >> >> Here's the problem. It is the same one faced by the open path. >> >> You determined above whether or not the device was getting removed. >> But you haven't left any state that indicates that the image is >> getting refreshed (or more specifically, getting its size updated). >> >> So, if the device is mapped but isn't actually opened by anybody >> there's nothing stopping the code in rbd_remove() from going >> ahead anyway. So the possibility of the structures getting freed >> remains (though the window is a little smaller now). > > I think this is safe with the use of ceph_osdc_flush_notifies() > before rbd_bus_del_dev(). Since the watch is already unregistered, > flushing the notifies waits for all calls to rbd_watch_cb() to > complete. This means that rbd_remove() can't actually free > any of the rbd_dev structures until after the last revalidate_disk() > etc. is done. Does this make sense, or am I missing something? Yes, now I remember. You're right. That was the point of doing the flush_notifies call before release anyway... So my "nothing stopping the code in rbd_remove" was wrong. In fact, that ceph_osdc_flush_notifies() call is stopping it from freeing things that may still be in use. Thanks for explaining it to me (again). > This should probably go in a comment in rbd_remove(), since > it's not obvious from the code there why the ordering of > the last few calls makes sense. Yes. I think so. Reviewed-by: Alex Elder <elder@xxxxxxxxxx> > Josh > >> I think you can resolve this problem by using the open count. >> That is, use the same interlock between the open count and the >> removing flag that's used for rbd_open() and rbd_remove(). >> The open count in some sense represents someone still needing >> the data structures for the image. So treat the refresh activity >> as one of those "somebodies" by claiming one of those opens >> while the size update is going on. >> >> If you encapsulated some code now in place for this purpose, >> something like the following might be helpful. (I've renamed >> "open_count" to be "inuse_count" here.) >> >> static inline int rbd_request_access(struct rbd_device *rbd_dev) >> { >> int ret = 0; >> >> spin_lock_irq(&rbd_dev->lock); >> if (!test_bit(RBD_DEV_FLAG_REMOVING, &rbd_dev->flags)) >> rbd_dev->inuse_count++; >> else >> ret = -ENOENT; >> spin_unlock_irq(&rbd_dev->lock); >> >> return ret; >> } >> >> static inline int rbd_release_access(struct rbd_device *rbd_dev) >> { >> int ret = 0; >> >> spin_lock_irq(&rbd_dev->lock); >> if (rbd_dev->inuse_count) >> rbd_dev->inuse_count--; >> else >> ret = -EINVAL; >> spin_unlock_irq(&rbd_dev->lock); >> >> return ret; >> } >> >> static inline int rbd_prevent_access(struct rbd_device *rbd_dev) >> { >> int ret = 0; >> >> spin_lock_irq(&rbd_dev->lock); >> if (rbd_dev->inuse_count) >> ret = -EBUSY; >> else if (test_and_set_bit(RBD_DEV_FLAG_REMOVING, &rbd_dev->flags)) >> ret = -EINPROGRESS; >> spin_unlock_irq(&rbd_dev->lock); >> >> return ret; >> } >> >> I'll assume you know what I'm talking about at this point and >> won't go into exactly where and how you'd use these. >> >>> + size = (sector_t)rbd_dev->mapping.size / SECTOR_SIZE; >>> + dout("setting size to %llu sectors", (unsigned long long)size); >>> + set_capacity(rbd_dev->disk, size); >>> + revalidate_disk(rbd_dev->disk); >>> + } >>> +}+- >>> + >>> static int rbd_dev_refresh(struct rbd_device *rbd_dev) >>> { >>> u64 mapping_size; >> >> Everything below is good. >> >>> @@ -3344,12 +3369,7 @@ static int rbd_dev_refresh(struct rbd_device >>> *rbd_dev) >>> up_write(&rbd_dev->header_rwsem); >>> >>> if (mapping_size != rbd_dev->mapping.size) { >>> - sector_t size; >>> - >>> - size = (sector_t)rbd_dev->mapping.size / SECTOR_SIZE; >>> - dout("setting size to %llu sectors", (unsigned long long)size); >>> - set_capacity(rbd_dev->disk, size); >>> - revalidate_disk(rbd_dev->disk); >>> + rbd_dev_update_size(rbd_dev); >>> } >>> >>> return ret; >>> @@ -5160,7 +5180,6 @@ static ssize_t rbd_remove(struct bus_type *bus, >>> if (ret < 0 || already) >>> return ret; >>> >>> - rbd_bus_del_dev(rbd_dev); >>> ret = rbd_dev_header_watch_sync(rbd_dev, false); >>> if (ret) >>> rbd_warn(rbd_dev, "failed to cancel watch event (%d)\n", ret); >>> @@ -5171,6 +5190,7 @@ static ssize_t rbd_remove(struct bus_type *bus, >>> */ >>> dout("%s: flushing notifies", __func__); >>> ceph_osdc_flush_notifies(&rbd_dev->rbd_client->client->osdc); >>> + rbd_bus_del_dev(rbd_dev); >>> rbd_dev_image_release(rbd_dev); >>> module_put(THIS_MODULE); >>> >>> >> > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html