On Sun, Jan 8, 2017 at 11:46 AM, Jan Kara <jack@xxxxxxx> wrote: > On Fri 06-01-17 09:45:45, Dan Williams wrote: >> On Fri, Jan 6, 2017 at 2:23 AM, Jan Kara <jack@xxxxxxx> wrote: >> > On Thu 05-01-17 17:17:55, Dan Williams wrote: >> >> The ->bd_queue member of struct block_device was added in commit >> >> 87192a2a49c4 ("vfs: cache request_queue in struct block_device") in >> >> v3.3. However, blk_get_backing_dev_info() has been using >> >> bdev_get_queue() and grabbing the request_queue through the gendisk >> >> since before the git era. >> >> >> >> At final __blkdev_put() time ->bd_disk is cleared while ->bd_queue is >> >> not. The queue remains valid until the final put of the parent disk. >> >> >> >> The following crash signature results from blk_get_backing_dev_info() >> >> trying to lookup the queue through ->bd_disk after the final put of the >> >> block device. Simply switch bdev_get_queue() to use ->bd_queue directly >> >> which is guaranteed to still be valid at invalidate_partition() time. >> >> >> >> BUG: unable to handle kernel NULL pointer dereference at 0000000000000568 >> >> IP: blk_get_backing_dev_info+0x10/0x20 >> >> [..] >> >> Call Trace: >> >> __inode_attach_wb+0x3a7/0x5d0 >> >> __filemap_fdatawrite_range+0xf8/0x100 >> >> filemap_write_and_wait+0x40/0x90 >> >> fsync_bdev+0x54/0x60 >> >> ? bdget_disk+0x30/0x40 >> >> invalidate_partition+0x24/0x50 >> >> del_gendisk+0xfa/0x230 >> > >> > So we have a similar reports of the same problem. E.g.: >> > >> > http://www.spinics.net/lists/linux-fsdevel/msg105153.html >> > >> > However I kind of miss how your patch would fix all those cases. The >> > principial problem is that inode_to_bdi() called on block device inode >> > wants to get the backing_dev_info however on last close of a block device >> > we do put_disk() and thus the request queue containing backing_dev_info >> > does not have to be around at that time. In your case you are lucky enough >> > to have the containing disk still around but that's not the case for all >> > inode_to_bdi() users (see e.g. the report I referenced) and your patch >> > would change relatively straightforward NULL pointer dereference to rather >> > subtle use-after-free issue >> >> True. If there are other cases that don't hold their own queue >> reference this patch makes things worse. >> >> > so I disagree with going down this path. >> >> I still think this patch is the right thing to do, but it needs to >> come after the wider guarantee that having an active bdev reference >> guarantees that the queue and backing_dev_info are still allocated. >> >> > So what I think needs to be done is that we make backing_dev_info >> > independently allocated structure with different lifetime rules to gendisk >> > or request_queue - definitely I want it to live as long as block device >> > inode exists. However it needs more thought what the exact lifetime rules >> > will be. >> >> Hmm, why does it need to be separately allocated? >> >> Something like this, passes the libnvdimm unit tests: (non-whitespace >> damaged version attached) > > So the problem with this approach is that request queue will be pinned while > bdev inode exists. And how long that is is impossible to predict or influence > from userspace so e.g. you cannot remove device driver from memory and even > unplugging USB after it has been unmounted would suddently go via a path of > "device removed while it is used" which can have unexpected consequences. I > guess Jens or Christoph will know more about the details... We do have the "block, fs: reliably communicate bdev end-of-life" effort that I need to revisit: http://www.spinics.net/lists/linux-fsdevel/msg93312.html ...but I don't immediately see how keeping the request_queue around longer makes the situation worse? > I have prototyped patches which split backing_dev_info from request_queue > and it was not even that difficult in the end. I'll give those patches some > testing and post them for comments... As long as the crashes are addressed I don't much care which solution goes upstream. That said I think it is unfortunate that we introduced ->bd_queue in v3.3, but most users are still pointer chasing through ->bd_disk->queue. I'll see if the 0day robot can find any performance benefit to this change outside of trying to fix a bdev-unplug crash. -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html