On Thu 25-04-24 18:58:48, Ritesh Harjani (IBM) wrote: > There is a possibility of following race with iomap during > writebck - > > write_cache_pages() > cache extent covering 0..1MB range > write page at offset 0k > truncate(file, 4k) > drops all relevant pages > frees fs blocks > pwrite(file, 4k, 4k) > creates dirty page in the page cache > writes page at offset 4k to a stale block > > This race can happen because iomap_writepages() keeps a cached extent mapping > within struct iomap. While write_cache_pages() is going over each folio, > (can cache a large extent range), if a truncate happens in parallel on the > next folio followed by a buffered write to the same offset within the file, > this can change logical to physical offset of the cached iomap mapping. > That means, the cached iomap has now become stale. > > This patch implements the seq counter approach for revalidation of stale > iomap mappings. i_blkseq will get incremented for every block > allocation/free. Here is what we do - > > For ext2 buffered-writes, the block allocation happens at the > ->write_iter time itself. So at writeback time, > 1. We first cache the i_blkseq. > 2. Call ext2_get_blocks(, create = 0) to get the no. of blocks > already allocated. > 3. Call ext2_get_blocks() the second time with length to be same as > the no. of blocks we know were already allocated. > 4. Till now it means, the cached i_blkseq remains valid as no block > allocation has happened yet. > This means the next call to ->map_blocks(), we can verify whether the > i_blkseq has raced with truncate or not. If not, then i_blkseq will > remain valid. > > In case of a hole (could happen with mmaped writes), we only allocate > 1 block at a time anyways. So even if the i_blkseq value changes right > after, we anyway need to allocate the next block in subsequent > ->map_blocks() call. > > Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx> A few small comments below. > @@ -698,6 +699,11 @@ static inline struct ext2_inode_info *EXT2_I(struct inode *inode) > return container_of(inode, struct ext2_inode_info, vfs_inode); > } > > +static inline void ext2_inc_i_blkseq(struct ext2_inode_info *ei) > +{ > + WRITE_ONCE(ei->i_blkseq, READ_ONCE(ei->i_blkseq) + 1); > +} > + Please add a comment here (and assertion as well) that updates of i_blkseq are protected by ei->i_truncate_mutex. Reads can race at any moment to that's the reason why WRITE_ONCE() is used. You can remove READ_ONCE() here as it is pointless (we are locked here). > static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc, > struct inode *inode, loff_t offset, > unsigned len) > { > - if (offset >= wpc->iomap.offset && > - offset < wpc->iomap.offset + wpc->iomap.length) > + loff_t maxblocks = (loff_t)INT_MAX; > + u8 blkbits = inode->i_blkbits; > + u32 bno; > + bool new, boundary; > + int ret; > + > + if (ext2_imap_valid(wpc, inode, offset)) > return 0; > > - return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize, > + /* > + * For ext2 buffered-writes, the block allocation happens at the > + * ->write_iter time itself. So at writeback time - > + * 1. We first cache the i_blkseq. > + * 2. Call ext2_get_blocks(, create = 0) to get the no. of blocks > + * already allocated. > + * 3. Call ext2_get_blocks() the second time with length to be same as > + * the no. of blocks we know were already allocated. > + * 4. Till now it means, the cached i_blkseq remains valid as no block > + * allocation has happened yet. > + * This means the next call to ->map_blocks(), we can verify whether the > + * i_blkseq has raced with truncate or not. If not, then i_blkseq will > + * remain valid. > + * > + * In case of a hole (could happen with mmaped writes), we only allocate > + * 1 block at a time anyways. So even if the i_blkseq value changes, we > + * anyway need to allocate the next block in subsequent ->map_blocks() > + * call. > + */ I suspect it would be tidier to move this logic into ext2_get_blocks() itself but I guess it is ok as is for now. > + wpc->iomap.validity_cookie = READ_ONCE(EXT2_I(inode)->i_blkseq); > + > + ret = ext2_get_blocks(inode, offset >> blkbits, maxblocks << blkbits, > + &bno, &new, &boundary, 0); > + if (ret < 0) > + return ret; > + /* > + * ret can be 0 in case of a hole which is possible for mmaped writes. > + */ > + ret = ret ? ret : 1; > + return ext2_iomap_begin(inode, offset, (loff_t)ret << blkbits, > IOMAP_WRITE, &wpc->iomap, NULL); > } > > diff --git a/fs/ext2/super.c b/fs/ext2/super.c > index 37f7ce56adce..32f5386284d6 100644 > --- a/fs/ext2/super.c > +++ b/fs/ext2/super.c > @@ -188,7 +188,7 @@ static struct inode *ext2_alloc_inode(struct super_block *sb) > #ifdef CONFIG_QUOTA > memset(&ei->i_dquot, 0, sizeof(ei->i_dquot)); > #endif > - > + WRITE_ONCE(ei->i_blkseq, 0); > return &ei->vfs_inode; No need for write once here. This cannot race with anything... Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR