On Thu, 2021-01-07 at 17:48 +0200, Maxim Levitsky wrote: > On Thu, 2021-01-07 at 16:40 +0100, Jan Kara wrote: > > blkdev_fallocate() tries to detect whether a discard raced with an > > overlapping write by calling invalidate_inode_pages2_range(). However > > this check can give both false negatives (when writing using direct IO > > or when writeback already writes out the written pagecache range) and > > false positives (when write is not actually overlapping but ends in the > > same page when blocksize < pagesize). This actually causes issues for > > qemu which is getting confused by EBUSY errors. > > > > Fix the problem by removing this conflicting write detection since it is > > inherently racy and thus of little use anyway. > > > > Reported-by: Maxim Levitsky <mlevitsk@xxxxxxxxxx> > > CC: "Darrick J. Wong" <darrick.wong@xxxxxxxxxx> > > Link: https://lore.kernel.org/qemu-devel/20201111153913.41840-1-mlevitsk@xxxxxxxxxx > > Signed-off-by: Jan Kara <jack@xxxxxxx> > > --- > > fs/block_dev.c | 10 ++++------ > > 1 file changed, 4 insertions(+), 6 deletions(-) > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > index 3e5b02f6606c..a97f43b49839 100644 > > --- a/fs/block_dev.c > > +++ b/fs/block_dev.c > > @@ -1797,13 +1797,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start, > > return error; > > > > /* > > - * Invalidate again; if someone wandered in and dirtied a page, > > - * the caller will be given -EBUSY. The third argument is > > - * inclusive, so the rounding here is safe. > > + * Invalidate the page cache again; if someone wandered in and dirtied > > + * a page, we just discard it - userspace has no way of knowing whether > > + * the write happened before or after discard completing... > > */ > > - return invalidate_inode_pages2_range(bdev->bd_inode->i_mapping, > > - start >> PAGE_SHIFT, > > - end >> PAGE_SHIFT); > > + return truncate_bdev_range(bdev, file->f_mode, start, end); > > } > > But what happens if write and discard don't overlap? Won't we > discard the written data in this case? Ah, I see, the truncate_bdev_range preserves the partial areas that are not included in the range. In this case this indeed looks right. Reviewed-by: Maxim Levitsky <mlevitsk@xxxxxxxxxx> Best regards, Maxim Levitsky > > > Best regards, > Maxim Levitsky > > > > > > const struct file_operations def_blk_fops = {