Hi Jan, On 2017/3/13 23:14, Jan Kara wrote: > blkdev_open() may race with gendisk shutdown in two different ways. > Either del_gendisk() has already unhashed block device inode (and thus > bd_acquire() will end up creating new block device inode) however > gen_gendisk() will still return the gendisk that is being destroyed. > Or bdev returned by bd_acquire() will get unhashed and gendisk destroyed > before we get to get_gendisk() and get_gendisk() will return new gendisk > that got allocated for device that reused the device number. > > In both cases this will result in possible inconsistencies between > bdev->bd_disk and bdev->bd_bdi (in the first case after gendisk gets > destroyed and device number reused, in the second case immediately). > Fix the problem by checking whether the gendisk is still alive and inode > hashed when associating bdev inode with it and its bdi. That way we are > sure that we will not associate bdev inode with disk that got past > blk_unregister_region() in del_gendisk() (and thus device number can get > reused). Similarly, we will not associate bdev that was once associated > with gendisk that is going away (and thus the corresponding bdev inode > will get unhashed in del_gendisk()) with a new gendisk that is just > reusing the device number. I have run some extended tests based on Omar Sandoval's script [1] these days, and found two new problems related with the race between bdev open and gendisk shutdown. The first problem lies in __blkdev_get(), and will cause use-after-free, or worse oops. It happens because there may be two instances of gendisk related with one bdev. When the process which owns the newer gendisk completes the invocation of __blkdev_get() firstly, the other process which owns the older gendisk will put the last reference of the older gendisk and cause use-after-free. The following call sequences illustrate the problem: unhash the bdev_inode of /dev/sda process A (blkdev_open()): bd_acquire() returns new bdev_inode of /dev/sda get_gendisk() returns gendisk (v1 gendisk) remove gendisk from bdev_map device number is reused process B (blkdev_open()): bd_acquire() returns new bdev_inode of /dev/sda get_gendisk() returns a new gendisk (v2 gendisk) increase bdev->bd_openers process A (blkdev_open()): find bdev->bd_openers != 0 put_disk(disk) // this is the last reference to v1 gendisk disk_unblock_events(disk) // use-after-free occurs The problem can be fixed by an extra check showed in the following snippet: static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part) { if (!bdev->bd_openers) { } else { if (bdev->bd_disk != disk) { ret = -ENXIO; goto out_unlock_bdev; } } } And we also can simplify the check in your patch by moving blk_unregister_region() before bdev_unhash_inode(), so whenever we get a new bdev_inode, the gendisk returned by get_gendisk() will either be NULL or a new gendisk. And the only check we need to do in __blkdev_get() is checking the hash status of the bdev_inode after we get a gendisk from get_gendisk(). The check can be done like the following snippet: static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part) { /* ...... */ ret = -ENXIO; disk = get_gendisk(bdev->bd_dev, &partno); if (!disk) goto out; owner = disk->fops->owner; if (inode_unhashed(bdev->bd_inode)) goto out_unmatched; } The second problem lies in bd_start_claiming(), and will trigger the BUG_ON assert in blkdev_get: BUG_ON(!bd_may_claim(bdev, whole, holder)). It occurs when testing the exclusive open and close of the disk and its partitions. The cause of the problem is that a bdev_inode of a disk partition corresponds with two instances of bdev_inode of the whole disk simultaneously. When one pair of the partition inode and disk inode completes the claiming, the other pair will be stumbled by the BUG_ON assert in blkdev_get() because bdev->bd_holder is no longer NULL. The following sequences illustrate the problem: unhash the bdev_inode of /dev/sda2 process A (blkdev_open()): bd_acquire() returns a new bdev_inode for /dev/sda2 bd_start_claiming() returns bdev_inode of /dev/sda process B (blkdev_open()): bd_acquire() returns the new bdev_inode for /dev/sda2 unhash the bdev_inode of /dev/sda remove gendisk from bdev_map device number is reused process B (blkdev_open()): bd_start_claming() returns a new bdev_inode for /dev/sda process A (blkdev_open()): __blkdev_get() returns successfully finish claiming // bdev->bd_holder = holder process B (blkdev_open()): __blkdev_get() returns successfully trigger BUG_ON(!bd_may_claim(bdev, whole, holder)) And the problem can be fixed by adding more checks in bd_start_claiming(): static struct block_device *bd_start_claiming(struct block_device *bdev, void *holder) { /* ...... */ disk = get_gendisk(bdev->bd_dev, &partno); if (!disk) return ERR_PTR(-ENXIO); if (inode_unhashed(bdev->bd_inode)) { module_put(disk->fops->owner); put_disk(disk); return ERR_PTR(-ENXIO); } /* ...... */ if (!whole) return ERR_PTR(-ENOMEM); if (inode_unhashed(bdev->bd_inode) || inode_unhashed(whole->bd_inode)) { bdput(whole); return ERR_PTR(-ENXIO); } } Except adding more and more checks, are there better solutions to the race problem ? I have thought about managing the device number by ref-counter, so the device number will not be reused until the last reference of bdev_inode and gendisk are released, and i am trying to figure out a way to get/put a reference of the device number when get/put the block_device. So any suggests and thoughts ? Regards, Tao --- 1: https://marc.info/?l=linux-block&m=148554717109098&w=2 > Also add a warning that will tell us about unexpected inconsistencies > between bdi associated with the bdev inode and bdi associated with the > disk. > > Signed-off-by: Jan Kara <jack@xxxxxxx> > --- > fs/block_dev.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 53e2389ae4d4..5ec8750f5332 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -1560,7 +1560,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part) > if (!partno) { > ret = -ENXIO; > bdev->bd_part = disk_get_part(disk, partno); > - if (!bdev->bd_part) > + if (!(disk->flags & GENHD_FL_UP) || !bdev->bd_part || > + inode_unhashed(bdev->bd_inode)) > goto out_clear; > > ret = 0; > @@ -1614,7 +1615,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part) > bdev->bd_contains = whole; > bdev->bd_part = disk_get_part(disk, partno); > if (!(disk->flags & GENHD_FL_UP) || > - !bdev->bd_part || !bdev->bd_part->nr_sects) { > + !bdev->bd_part || !bdev->bd_part->nr_sects || > + inode_unhashed(bdev->bd_inode)) { > ret = -ENXIO; > goto out_clear; > } > @@ -1623,6 +1625,9 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part) > > if (bdev->bd_bdi == &noop_backing_dev_info) > bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info); > + else > + WARN_ON_ONCE(bdev->bd_bdi != > + disk->queue->backing_dev_info); > } else { > if (bdev->bd_contains == bdev) { > ret = 0; >