Mikulas Patocka <mpatocka@xxxxxxxxxx> writes: > On Thu, 28 Jun 2012, Jan Kara wrote: > >> On Wed 27-06-12 23:04:09, Mikulas Patocka wrote: >> > The kernel crashes when IO is being submitted to a block device and block >> > size of that device is changed simultaneously. >> Nasty ;-) >> >> > To reproduce the crash, apply this patch: >> > >> > --- linux-3.4.3-fast.orig/fs/block_dev.c 2012-06-27 20:24:07.000000000 +0200 >> > +++ linux-3.4.3-fast/fs/block_dev.c 2012-06-27 20:28:34.000000000 +0200 >> > @@ -28,6 +28,7 @@ >> > #include <linux/log2.h> >> > #include <linux/cleancache.h> >> > #include <asm/uaccess.h> >> > +#include <linux/delay.h> >> > #include "internal.h" >> > struct bdev_inode { >> > @@ -203,6 +204,7 @@ blkdev_get_blocks(struct inode *inode, s >> > >> > bh->b_bdev = I_BDEV(inode); >> > bh->b_blocknr = iblock; >> > + msleep(1000); >> > bh->b_size = max_blocks << inode->i_blkbits; >> > if (max_blocks) >> > set_buffer_mapped(bh); >> > >> > Use some device with 4k blocksize, for example a ramdisk. >> > Run "dd if=/dev/ram0 of=/dev/null bs=4k count=1 iflag=direct" >> > While it is sleeping in the msleep function, run "blockdev --setbsz 2048 >> > /dev/ram0" on the other console. >> > You get a BUG at fs/direct-io.c:1013 - BUG_ON(this_chunk_bytes == 0); >> > >> > >> > One may ask "why would anyone do this - submit I/O and change block size >> > simultaneously?" - the problem is that udev and lvm can scan and read all >> > block devices anytime - so anytime you change block device size, there may >> > be some i/o to that device in flight and the crash may happen. That BUG >> > actually happened in production environment because of lvm scanning block >> > devices and some other software changing block size at the same time. >> > >> Yeah, it's nasty and neither solution looks particularly appealing. One >> idea that came to my mind is: I'm trying to solve some races between direct >> IO, buffered IO, hole punching etc. by a new mapping interval lock. I'm not >> sure if it will go anywhere yet but if it does, we can fix the above race >> by taking the mapping lock for the whole block device around setting block >> size thus effectivelly disallowing any IO to it. >> >> Honza >> -- >> Jan Kara <jack@xxxxxxx> >> SUSE Labs, CR >> > > Hi > > This is the patch that fixes this crash: it takes a rw-semaphore around > all direct-IO path. > > (note that if someone is concerned about performance, the rw-semaphore > could be made per-cpu --- take it for read on the current CPU and take it > for write on all CPUs). Here we go again. :-) I believe we had at one point tried taking a rw semaphore around GUP inside of the direct I/O code path to fix the fork vs. GUP race (that still exists today). When testing that, the overhead of the semaphore was *way* too high to be considered an acceptable solution. I've CC'd Larry Woodman, Andrea, and Kosaki Motohiro who all worked on that particular bug. Hopefully they can give better quantification of the slowdown than my poor memory. Cheers, Jeff > blockdev: fix a crash when block size is changed and I/O is issued simultaneously > > The kernel may crash when block size is changed and I/O is issued > simultaneously. > > Because some subsystems (udev or lvm) may read any block device anytime, > the bug actually puts any code that changes a block device size in > jeopardy. > > The crash can be reproduced if you place "msleep(1000)" to > blkdev_get_blocks just before "bh->b_size = max_blocks << > inode->i_blkbits;". > Then, run "dd if=/dev/ram0 of=/dev/null bs=4k count=1 iflag=direct" > While it is waiting in msleep, run "blockdev --setbsz 2048 /dev/ram0" > You get a BUG. > > The direct and non-direct I/O is written with the assumption that block > size does not change. It doesn't seem practical to fix these crashes > one-by-one there may be many crash possibilities when block size changes > at a certain place and it is impossible to find them all and verify the > code. > > This patch introduces a new rw-lock bd_block_size_semaphore. The lock is > taken for read during I/O. It is taken for write when changing block > size. Consequently, block size can't be changed while I/O is being > submitted. > > For asynchronous I/O, the patch only prevents block size change while > the I/O is being submitted. The block size can change when the I/O is in > progress or when the I/O is being finished. This is acceptable because > there are no accesses to block size when asynchronous I/O is being > finished. > > The patch prevents block size changing while the device is mapped with > mmap. > > Signed-off-by: Mikulas Patocka <mpatocka@xxxxxxxxxx> > > --- > drivers/char/raw.c | 2 - > fs/block_dev.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++-- > include/linux/fs.h | 4 +++ > 3 files changed, 61 insertions(+), 3 deletions(-) > > Index: linux-3.5-rc6-devel/include/linux/fs.h > =================================================================== > --- linux-3.5-rc6-devel.orig/include/linux/fs.h 2012-07-16 01:18:45.000000000 +0200 > +++ linux-3.5-rc6-devel/include/linux/fs.h 2012-07-16 01:29:21.000000000 +0200 > @@ -713,6 +713,8 @@ struct block_device { > int bd_fsfreeze_count; > /* Mutex for freeze */ > struct mutex bd_fsfreeze_mutex; > + /* A semaphore that prevents I/O while block size is being changed */ > + struct rw_semaphore bd_block_size_semaphore; > }; > > /* > @@ -2414,6 +2416,8 @@ extern int generic_segment_checks(const > unsigned long *nr_segs, size_t *count, int access_flags); > > /* fs/block_dev.c */ > +extern ssize_t blkdev_aio_read(struct kiocb *iocb, const struct iovec *iov, > + unsigned long nr_segs, loff_t pos); > extern ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov, > unsigned long nr_segs, loff_t pos); > extern int blkdev_fsync(struct file *filp, loff_t start, loff_t end, > Index: linux-3.5-rc6-devel/fs/block_dev.c > =================================================================== > --- linux-3.5-rc6-devel.orig/fs/block_dev.c 2012-07-16 01:14:33.000000000 +0200 > +++ linux-3.5-rc6-devel/fs/block_dev.c 2012-07-16 01:37:28.000000000 +0200 > @@ -124,6 +124,20 @@ int set_blocksize(struct block_device *b > if (size < bdev_logical_block_size(bdev)) > return -EINVAL; > > + /* Prevent starting I/O or mapping the device */ > + down_write(&bdev->bd_block_size_semaphore); > + > + /* Check that the block device is not memory mapped */ > + mapping = bdev->bd_inode->i_mapping; > + mutex_lock(&mapping->i_mmap_mutex); > + if (!prio_tree_empty(&mapping->i_mmap) || > + !list_empty(&mapping->i_mmap_nonlinear)) { > + mutex_unlock(&mapping->i_mmap_mutex); > + up_write(&bdev->bd_block_size_semaphore); > + return -EBUSY; > + } > + mutex_unlock(&mapping->i_mmap_mutex); > + > /* Don't change the size if it is same as current */ > if (bdev->bd_block_size != size) { > sync_blockdev(bdev); > @@ -131,6 +145,9 @@ int set_blocksize(struct block_device *b > bdev->bd_inode->i_blkbits = blksize_bits(size); > kill_bdev(bdev); > } > + > + up_write(&bdev->bd_block_size_semaphore); > + > return 0; > } > > @@ -472,6 +489,7 @@ static void init_once(void *foo) > inode_init_once(&ei->vfs_inode); > /* Initialize mutex for freeze. */ > mutex_init(&bdev->bd_fsfreeze_mutex); > + init_rwsem(&bdev->bd_block_size_semaphore); > } > > static inline void __bd_forget(struct inode *inode) > @@ -1567,6 +1585,22 @@ static long block_ioctl(struct file *fil > return blkdev_ioctl(bdev, mode, cmd, arg); > } > > +ssize_t blkdev_aio_read(struct kiocb *iocb, const struct iovec *iov, > + unsigned long nr_segs, loff_t pos) > +{ > + ssize_t ret; > + struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host); > + > + down_read(&bdev->bd_block_size_semaphore); > + > + ret = generic_file_aio_read(iocb, iov, nr_segs, pos); > + > + up_read(&bdev->bd_block_size_semaphore); > + > + return ret; > +} > +EXPORT_SYMBOL_GPL(blkdev_aio_read); > + > /* > * Write data to the block device. Only intended for the block device itself > * and the raw driver which basically is a fake block device. > @@ -1578,10 +1612,13 @@ ssize_t blkdev_aio_write(struct kiocb *i > unsigned long nr_segs, loff_t pos) > { > struct file *file = iocb->ki_filp; > + struct block_device *bdev = I_BDEV(file->f_mapping->host); > ssize_t ret; > > BUG_ON(iocb->ki_pos != pos); > > + down_read(&bdev->bd_block_size_semaphore); > + > ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos); > if (ret > 0 || ret == -EIOCBQUEUED) { > ssize_t err; > @@ -1590,10 +1627,27 @@ ssize_t blkdev_aio_write(struct kiocb *i > if (err < 0 && ret > 0) > ret = err; > } > + > + up_read(&bdev->bd_block_size_semaphore); > + > return ret; > } > EXPORT_SYMBOL_GPL(blkdev_aio_write); > > +int blkdev_mmap(struct file *file, struct vm_area_struct *vma) > +{ > + int ret; > + struct block_device *bdev = I_BDEV(file->f_mapping->host); > + > + down_read(&bdev->bd_block_size_semaphore); > + > + ret = generic_file_mmap(file, vma); > + > + up_read(&bdev->bd_block_size_semaphore); > + > + return ret; > +} > + > /* > * Try to release a page associated with block device when the system > * is under memory pressure. > @@ -1624,9 +1678,9 @@ const struct file_operations def_blk_fop > .llseek = block_llseek, > .read = do_sync_read, > .write = do_sync_write, > - .aio_read = generic_file_aio_read, > + .aio_read = blkdev_aio_read, > .aio_write = blkdev_aio_write, > - .mmap = generic_file_mmap, > + .mmap = blkdev_mmap, > .fsync = blkdev_fsync, > .unlocked_ioctl = block_ioctl, > #ifdef CONFIG_COMPAT > Index: linux-3.5-rc6-devel/drivers/char/raw.c > =================================================================== > --- linux-3.5-rc6-devel.orig/drivers/char/raw.c 2012-07-16 01:29:27.000000000 +0200 > +++ linux-3.5-rc6-devel/drivers/char/raw.c 2012-07-16 01:30:04.000000000 +0200 > @@ -285,7 +285,7 @@ static long raw_ctl_compat_ioctl(struct > > static const struct file_operations raw_fops = { > .read = do_sync_read, > - .aio_read = generic_file_aio_read, > + .aio_read = blkdev_aio_read, > .write = do_sync_write, > .aio_write = blkdev_aio_write, > .fsync = blkdev_fsync, > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html