On 12/12/2017 01:16 AM, Bart Van Assche wrote: > On Sun, 2017-12-10 at 15:44 +0100, Hannes Reinecke wrote: >> You know what, I have been working on a similar patch for quite some >> time now; [ ... ] > > That's very interesting :-) > >> However, in doing so I have encountered several issues which have been >> exposed by that; the most trivial one being that del_gendisk() doesn't >> check if GENHD_FL_UP is set, so it'll crash if sd_remove is called >> before sd_async_probe() is run. > > Have you noticed that my patch adds a call to sd_wait_for_probing() before > the call to del_gendisk()? That should be sufficient to prevent the crash you > described. > Not necessarily; see below. >> The other one is an imbalance between sd_probe and sd_remove; >> when sd_probe_async() is called _after_ scsi_device_remove_device() >> (which it will as the synchronization is only after device_del() has >> been called) it will trip over non-existent sysfs directories in add_disk(). >> So one need to short-circuit sd_probe_async() for devices in SDEV_CANCEL >> or SDEV_DEL. > > Same comment here - I think the sd_wait_for_probing() call I inserted in > sd_remove() should prevent this scenario. > No, it doesn't. sd_remove is called via __scsi_device_remove_device() -> device_del() -> sd_remove() with this piece of code: bsg_unregister_queue(sdev->request_queue); device_unregister(&sdev->sdev_dev); transport_remove_device(dev); device_del(dev); ie by the time device_del() we've already called device_unregister(), which unfortunately is the sysfs parent kobj for the gendisk. So when you're calling sd_wait_for_probing() in sd_remove() it'll call sd_async_probe(), and add_disk() will be called with a NULL parent object. >> However, I'm still waiting for a final confirmation on the latter issue, >> hence I haven't posted the patchset. >> If there's interest I can post them, though. > > The big question here is how you would like to proceed - should we start from > your patch series or start from my patch? Since I have not encountered any of > the issues you described with my patch, does that mean that the approach I > followed results in more reliable operation? > Not necessarily. The issues I've described got reported by a big CDN provider after several days worth of hammering. It's not something I could reproduce here in my puny lab. So just because you haven't seen the race doesn't mean there is none :-) I'll be posting the patches shortly. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@xxxxxxxx +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)