On 2024/4/27 0:39, Ritesh Harjani (IBM) wrote: > Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> writes: > >> On 2024/4/26 20:57, Ritesh Harjani (IBM) wrote: >>> Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx> writes: >>> >>>> Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> writes: >>>> >>>>> From: Zhang Yi <yi.zhang@xxxxxxxxxx> >>>>> >>>>> Now we lookup extent status entry without holding the i_data_sem before >>>>> inserting delalloc block, it works fine in buffered write path and >>>>> because it holds i_rwsem and folio lock, and the mmap path holds folio >>>>> lock, so the found extent locklessly couldn't be modified concurrently. >>>>> But it could be raced by fallocate since it allocate block whitout >>>>> holding i_rwsem and folio lock. >>>>> >>>>> ext4_page_mkwrite() ext4_fallocate() >>>>> block_page_mkwrite() >>>>> ext4_da_map_blocks() >>>>> //find hole in extent status tree >>>>> ext4_alloc_file_blocks() >>>>> ext4_map_blocks() >>>>> //allocate block and unwritten extent >>>>> ext4_insert_delayed_block() >>>>> ext4_da_reserve_space() >>>>> //reserve one more block >>>>> ext4_es_insert_delayed_block() >>>>> //drop unwritten extent and add delayed extent by mistake >>>>> >>>>> Then, the delalloc extent is wrong until writeback, the one more >>>>> reserved block can't be release any more and trigger below warning: >>>>> >>>>> EXT4-fs (pmem2): Inode 13 (00000000bbbd4d23): i_reserved_data_blocks(1) not cleared! >>>>> >>>>> Hold i_data_sem in write mode directly can fix the problem, but it's >>>>> expansive, we should keep the lockless check and check the extent again >>>>> once we need to add an new delalloc block. >>>> >>>> Hi Zhang, >>>> >>>> It's a nice finding. I was wondering if this was caught in any of the >>>> xfstests? >>>> >> >> Hi, Ritesh >> >> I caught this issue when I tested my iomap series in generic/344 and >> generic/346. It's easy to reproduce because the iomap's buffered write path >> doesn't hold folio lock while inserting delalloc blocks, so it could be raced >> by the mmap page fault path. But the buffer_head's buffered write path can't >> trigger this problem, > > ya right! That's the difference between how ->map_blocks() is called > between buffer_head v/s iomap path. In iomap the ->map_blocks() call > happens first to map a large extent and then it iterate over all the > locked folios covering the mapped extent for doing writes. > Whereas in buffer_head while iterating, we first instantiate/lock the > folio and then call ->map_blocks() to map an extent for the given folio. > > ... So this opens up this window for a race between iomap buffered write > path v/s page mkwrite path for inserting delalloc blocks entries. > >> the race between buffered write path and fallocate path >> was discovered while I was analyzing the code, so I'm not sure if it could >> be caught by xfstests now, at least I haven't noticed this problem so far. >> > > Did you mean the race between page fault path and fallocate path here? > Because buffered write path and fallocate path should not have any race > since both takes the inode_lock. I guess you meant page fault path and > fallocate path for which you wrote this patch too :) Yep. > > I am surprised, why we cannot see the this race between page mkwrite and > fallocate in fstests for inserting da entries to extent status cache. > Because the race you identified looks like a legitimate race and is > mostly happening since ext4_da_map_blocks() was not doing the right > thing. > ... looking at the src/holetest, it doesn't really excercise this path. > So maybe we can writing such fstest to trigger this race. > I guess the stress tests and smoke tests in fstests have caught it, e.g. generic/476. Since there is only one error message in ext4_destroy_inode() when the race issue happened, we can't detect it unless we go and check the logs manually. I suppose we need to add more warnings, something like this, how does it sound? diff --git a/fs/ext4/super.c b/fs/ext4/super.c index c8b691e605f1..4b6fd9b63b12 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1255,6 +1255,8 @@ static void ext4_percpu_param_destroy(struct ext4_sb_info *sbi) percpu_counter_destroy(&sbi->s_freeclusters_counter); percpu_counter_destroy(&sbi->s_freeinodes_counter); percpu_counter_destroy(&sbi->s_dirs_counter); + WARN_ON_ONCE(!ext4_forced_shutdown(sbi->s_sb) && + percpu_counter_sum(&sbi->s_dirtyclusters_counter)); percpu_counter_destroy(&sbi->s_dirtyclusters_counter); percpu_counter_destroy(&sbi->s_sra_exceeded_retry_limit); percpu_free_rwsem(&sbi->s_writepages_rwsem); @@ -1476,7 +1478,8 @@ static void ext4_destroy_inode(struct inode *inode) dump_stack(); } - if (EXT4_I(inode)->i_reserved_data_blocks) + if (!ext4_forced_shutdown(inode->i_sb) && + WARN_ON_ONCE(EXT4_I(inode)->i_reserved_data_blocks)) ext4_msg(inode->i_sb, KERN_ERR, "Inode %lu (%p): i_reserved_data_blocks (%u) not cleared!", inode->i_ino, EXT4_I(inode), Thanks, Yi. > >>>> I have reworded some of the commit message, feel free to use it if you >>>> think this version is better. The use of which path uses which locks was >>>> a bit confusing in the original commit message. >>>> >> >> Thanks for the message improvement, it looks more clear then mine, I will >> use it. >> > > Glad, it was helpful. > > -ritesh >