Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx> writes: > Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> writes: > >> From: Zhang Yi <yi.zhang@xxxxxxxxxx> >> >> Now we lookup extent status entry without holding the i_data_sem before >> inserting delalloc block, it works fine in buffered write path and >> because it holds i_rwsem and folio lock, and the mmap path holds folio >> lock, so the found extent locklessly couldn't be modified concurrently. >> But it could be raced by fallocate since it allocate block whitout >> holding i_rwsem and folio lock. >> >> ext4_page_mkwrite() ext4_fallocate() >> block_page_mkwrite() >> ext4_da_map_blocks() >> //find hole in extent status tree >> ext4_alloc_file_blocks() >> ext4_map_blocks() >> //allocate block and unwritten extent >> ext4_insert_delayed_block() >> ext4_da_reserve_space() >> //reserve one more block >> ext4_es_insert_delayed_block() >> //drop unwritten extent and add delayed extent by mistake >> >> Then, the delalloc extent is wrong until writeback, the one more >> reserved block can't be release any more and trigger below warning: >> >> EXT4-fs (pmem2): Inode 13 (00000000bbbd4d23): i_reserved_data_blocks(1) not cleared! >> >> Hold i_data_sem in write mode directly can fix the problem, but it's >> expansive, we should keep the lockless check and check the extent again >> once we need to add an new delalloc block. > > Hi Zhang, > > It's a nice finding. I was wondering if this was caught in any of the > xfstests? > > I have reworded some of the commit message, feel free to use it if you > think this version is better. The use of which path uses which locks was > a bit confusing in the original commit message. > > <reworded from your original commit msg> > > ext4_da_map_blocks(), first looks up the extent status tree for any > extent entry with i_data_sem held in read mode. It then unlocks > i_data_sem, if it can't find an entry and take this lock in write > mode for inserting a new da entry. Sorry about this above paragraph. I messed this paragraph. Here is the correct version of this. ext4_da_map_blocks looks up for any extent entry in the extent status tree (w/o i_data_sem) and then the looks up for any ondisk extent mapping (with i_data_sem in read mode). If it finds a hole in the extent status tree or if it couldn't find any entry at all, it then takes the i_data_sem in write mode to add a da entry into the extent status tree. This can actually race with page mkwrite & fallocate path. Note that this is ok between... <and the rest can remain same> > > This is ok between - > 1. ext4 buffered-write path v/s ext4_page_mkwrite(), because of the > folio lock > 2. ext4 buffered write path v/s ext4 fallocate because of the inode > lock. > > But this can race between ext4_page_mkwrite() & ext4 fallocate path - > > ext4_page_mkwrite() ext4_fallocate() > block_page_mkwrite() > ext4_da_map_blocks() > //find hole in extent status tree > ext4_alloc_file_blocks() > ext4_map_blocks() > //allocate block and unwritten extent > ext4_insert_delayed_block() > ext4_da_reserve_space() > //reserve one more block > ext4_es_insert_delayed_block() > //drop unwritten extent and add delayed extent by mistake > > Then, the delalloc extent is wrong until writeback and the extra > reserved block can't be released any more and it triggers below warning: > > EXT4-fs (pmem2): Inode 13 (00000000bbbd4d23): i_reserved_data_blocks(1) not cleared! > > This patch fixes the problem by looking up extent status tree again > while the i_data_sem is held in write mode. If it still can't find > any entry, then we insert a new da entry into the extent status tree. > >> >> Cc: stable@xxxxxxxxxxxxxxx >> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx> >> --- >> fs/ext4/inode.c | 19 +++++++++++++++++++ >> 1 file changed, 19 insertions(+) >> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >> index 6a41172c06e1..118b0497a954 100644 >> --- a/fs/ext4/inode.c >> +++ b/fs/ext4/inode.c >> @@ -1737,6 +1737,7 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock, >> if (ext4_es_is_hole(&es)) >> goto add_delayed; >> >> +found: >> /* >> * Delayed extent could be allocated by fallocate. >> * So we need to check it. >> @@ -1781,6 +1782,24 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock, >> >> add_delayed: >> down_write(&EXT4_I(inode)->i_data_sem); >> + /* >> + * Lookup extents tree again under i_data_sem, make sure this >> + * inserting delalloc range haven't been delayed or allocated >> + * whitout holding i_rwsem and folio lock. >> + */ > > page fault path (ext4_page_mkwrite does not take i_rwsem) and fallocate > path (no folio lock) can race. Make sure we lookup the extent status > tree here again while i_data_sem is held in write mode, before inserting > a new da entry in the extent status tree. > > -ritesh