On Thu, 28 Jun 2012, Jan Kara wrote: > On Wed 27-06-12 23:04:09, Mikulas Patocka wrote: > > The kernel crashes when IO is being submitted to a block device and block > > size of that device is changed simultaneously. > Nasty ;-) > > > To reproduce the crash, apply this patch: > > > > --- linux-3.4.3-fast.orig/fs/block_dev.c 2012-06-27 20:24:07.000000000 +0200 > > +++ linux-3.4.3-fast/fs/block_dev.c 2012-06-27 20:28:34.000000000 +0200 > > @@ -28,6 +28,7 @@ > > #include <linux/log2.h> > > #include <linux/cleancache.h> > > #include <asm/uaccess.h> > > +#include <linux/delay.h> > > #include "internal.h" > > struct bdev_inode { > > @@ -203,6 +204,7 @@ blkdev_get_blocks(struct inode *inode, s > > > > bh->b_bdev = I_BDEV(inode); > > bh->b_blocknr = iblock; > > + msleep(1000); > > bh->b_size = max_blocks << inode->i_blkbits; > > if (max_blocks) > > set_buffer_mapped(bh); > > > > Use some device with 4k blocksize, for example a ramdisk. > > Run "dd if=/dev/ram0 of=/dev/null bs=4k count=1 iflag=direct" > > While it is sleeping in the msleep function, run "blockdev --setbsz 2048 > > /dev/ram0" on the other console. > > You get a BUG at fs/direct-io.c:1013 - BUG_ON(this_chunk_bytes == 0); > > > > > > One may ask "why would anyone do this - submit I/O and change block size > > simultaneously?" - the problem is that udev and lvm can scan and read all > > block devices anytime - so anytime you change block device size, there may > > be some i/o to that device in flight and the crash may happen. That BUG > > actually happened in production environment because of lvm scanning block > > devices and some other software changing block size at the same time. > > > Yeah, it's nasty and neither solution looks particularly appealing. One > idea that came to my mind is: I'm trying to solve some races between direct > IO, buffered IO, hole punching etc. by a new mapping interval lock. I'm not > sure if it will go anywhere yet but if it does, we can fix the above race > by taking the mapping lock for the whole block device around setting block > size thus effectivelly disallowing any IO to it. > > Honza > -- > Jan Kara <jack@xxxxxxx> > SUSE Labs, CR > Hi This is the patch that fixes this crash: it takes a rw-semaphore around all direct-IO path. (note that if someone is concerned about performance, the rw-semaphore could be made per-cpu --- take it for read on the current CPU and take it for write on all CPUs). Mikulas --- blockdev: fix a crash when block size is changed and I/O is issued simultaneously The kernel may crash when block size is changed and I/O is issued simultaneously. Because some subsystems (udev or lvm) may read any block device anytime, the bug actually puts any code that changes a block device size in jeopardy. The crash can be reproduced if you place "msleep(1000)" to blkdev_get_blocks just before "bh->b_size = max_blocks << inode->i_blkbits;". Then, run "dd if=/dev/ram0 of=/dev/null bs=4k count=1 iflag=direct" While it is waiting in msleep, run "blockdev --setbsz 2048 /dev/ram0" You get a BUG. The direct and non-direct I/O is written with the assumption that block size does not change. It doesn't seem practical to fix these crashes one-by-one there may be many crash possibilities when block size changes at a certain place and it is impossible to find them all and verify the code. This patch introduces a new rw-lock bd_block_size_semaphore. The lock is taken for read during I/O. It is taken for write when changing block size. Consequently, block size can't be changed while I/O is being submitted. For asynchronous I/O, the patch only prevents block size change while the I/O is being submitted. The block size can change when the I/O is in progress or when the I/O is being finished. This is acceptable because there are no accesses to block size when asynchronous I/O is being finished. The patch prevents block size changing while the device is mapped with mmap. Signed-off-by: Mikulas Patocka <mpatocka@xxxxxxxxxx> --- drivers/char/raw.c | 2 - fs/block_dev.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++-- include/linux/fs.h | 4 +++ 3 files changed, 61 insertions(+), 3 deletions(-) Index: linux-3.5-rc6-devel/include/linux/fs.h =================================================================== --- linux-3.5-rc6-devel.orig/include/linux/fs.h 2012-07-16 01:18:45.000000000 +0200 +++ linux-3.5-rc6-devel/include/linux/fs.h 2012-07-16 01:29:21.000000000 +0200 @@ -713,6 +713,8 @@ struct block_device { int bd_fsfreeze_count; /* Mutex for freeze */ struct mutex bd_fsfreeze_mutex; + /* A semaphore that prevents I/O while block size is being changed */ + struct rw_semaphore bd_block_size_semaphore; }; /* @@ -2414,6 +2416,8 @@ extern int generic_segment_checks(const unsigned long *nr_segs, size_t *count, int access_flags); /* fs/block_dev.c */ +extern ssize_t blkdev_aio_read(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos); extern ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos); extern int blkdev_fsync(struct file *filp, loff_t start, loff_t end, Index: linux-3.5-rc6-devel/fs/block_dev.c =================================================================== --- linux-3.5-rc6-devel.orig/fs/block_dev.c 2012-07-16 01:14:33.000000000 +0200 +++ linux-3.5-rc6-devel/fs/block_dev.c 2012-07-16 01:37:28.000000000 +0200 @@ -124,6 +124,20 @@ int set_blocksize(struct block_device *b if (size < bdev_logical_block_size(bdev)) return -EINVAL; + /* Prevent starting I/O or mapping the device */ + down_write(&bdev->bd_block_size_semaphore); + + /* Check that the block device is not memory mapped */ + mapping = bdev->bd_inode->i_mapping; + mutex_lock(&mapping->i_mmap_mutex); + if (!prio_tree_empty(&mapping->i_mmap) || + !list_empty(&mapping->i_mmap_nonlinear)) { + mutex_unlock(&mapping->i_mmap_mutex); + up_write(&bdev->bd_block_size_semaphore); + return -EBUSY; + } + mutex_unlock(&mapping->i_mmap_mutex); + /* Don't change the size if it is same as current */ if (bdev->bd_block_size != size) { sync_blockdev(bdev); @@ -131,6 +145,9 @@ int set_blocksize(struct block_device *b bdev->bd_inode->i_blkbits = blksize_bits(size); kill_bdev(bdev); } + + up_write(&bdev->bd_block_size_semaphore); + return 0; } @@ -472,6 +489,7 @@ static void init_once(void *foo) inode_init_once(&ei->vfs_inode); /* Initialize mutex for freeze. */ mutex_init(&bdev->bd_fsfreeze_mutex); + init_rwsem(&bdev->bd_block_size_semaphore); } static inline void __bd_forget(struct inode *inode) @@ -1567,6 +1585,22 @@ static long block_ioctl(struct file *fil return blkdev_ioctl(bdev, mode, cmd, arg); } +ssize_t blkdev_aio_read(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) +{ + ssize_t ret; + struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host); + + down_read(&bdev->bd_block_size_semaphore); + + ret = generic_file_aio_read(iocb, iov, nr_segs, pos); + + up_read(&bdev->bd_block_size_semaphore); + + return ret; +} +EXPORT_SYMBOL_GPL(blkdev_aio_read); + /* * Write data to the block device. Only intended for the block device itself * and the raw driver which basically is a fake block device. @@ -1578,10 +1612,13 @@ ssize_t blkdev_aio_write(struct kiocb *i unsigned long nr_segs, loff_t pos) { struct file *file = iocb->ki_filp; + struct block_device *bdev = I_BDEV(file->f_mapping->host); ssize_t ret; BUG_ON(iocb->ki_pos != pos); + down_read(&bdev->bd_block_size_semaphore); + ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos); if (ret > 0 || ret == -EIOCBQUEUED) { ssize_t err; @@ -1590,10 +1627,27 @@ ssize_t blkdev_aio_write(struct kiocb *i if (err < 0 && ret > 0) ret = err; } + + up_read(&bdev->bd_block_size_semaphore); + return ret; } EXPORT_SYMBOL_GPL(blkdev_aio_write); +int blkdev_mmap(struct file *file, struct vm_area_struct *vma) +{ + int ret; + struct block_device *bdev = I_BDEV(file->f_mapping->host); + + down_read(&bdev->bd_block_size_semaphore); + + ret = generic_file_mmap(file, vma); + + up_read(&bdev->bd_block_size_semaphore); + + return ret; +} + /* * Try to release a page associated with block device when the system * is under memory pressure. @@ -1624,9 +1678,9 @@ const struct file_operations def_blk_fop .llseek = block_llseek, .read = do_sync_read, .write = do_sync_write, - .aio_read = generic_file_aio_read, + .aio_read = blkdev_aio_read, .aio_write = blkdev_aio_write, - .mmap = generic_file_mmap, + .mmap = blkdev_mmap, .fsync = blkdev_fsync, .unlocked_ioctl = block_ioctl, #ifdef CONFIG_COMPAT Index: linux-3.5-rc6-devel/drivers/char/raw.c =================================================================== --- linux-3.5-rc6-devel.orig/drivers/char/raw.c 2012-07-16 01:29:27.000000000 +0200 +++ linux-3.5-rc6-devel/drivers/char/raw.c 2012-07-16 01:30:04.000000000 +0200 @@ -285,7 +285,7 @@ static long raw_ctl_compat_ioctl(struct static const struct file_operations raw_fops = { .read = do_sync_read, - .aio_read = generic_file_aio_read, + .aio_read = blkdev_aio_read, .write = do_sync_write, .aio_write = blkdev_aio_write, .fsync = blkdev_fsync, -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html