Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> writes: > On 2024/4/26 20:57, Ritesh Harjani (IBM) wrote: >> Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx> writes: >> >>> Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> writes: >>> >>>> From: Zhang Yi <yi.zhang@xxxxxxxxxx> >>>> >>>> Now we lookup extent status entry without holding the i_data_sem before >>>> inserting delalloc block, it works fine in buffered write path and >>>> because it holds i_rwsem and folio lock, and the mmap path holds folio >>>> lock, so the found extent locklessly couldn't be modified concurrently. >>>> But it could be raced by fallocate since it allocate block whitout >>>> holding i_rwsem and folio lock. >>>> >>>> ext4_page_mkwrite() ext4_fallocate() >>>> block_page_mkwrite() >>>> ext4_da_map_blocks() >>>> //find hole in extent status tree >>>> ext4_alloc_file_blocks() >>>> ext4_map_blocks() >>>> //allocate block and unwritten extent >>>> ext4_insert_delayed_block() >>>> ext4_da_reserve_space() >>>> //reserve one more block >>>> ext4_es_insert_delayed_block() >>>> //drop unwritten extent and add delayed extent by mistake >>>> >>>> Then, the delalloc extent is wrong until writeback, the one more >>>> reserved block can't be release any more and trigger below warning: >>>> >>>> EXT4-fs (pmem2): Inode 13 (00000000bbbd4d23): i_reserved_data_blocks(1) not cleared! >>>> >>>> Hold i_data_sem in write mode directly can fix the problem, but it's >>>> expansive, we should keep the lockless check and check the extent again >>>> once we need to add an new delalloc block. >>> >>> Hi Zhang, >>> >>> It's a nice finding. I was wondering if this was caught in any of the >>> xfstests? >>> > > Hi, Ritesh > > I caught this issue when I tested my iomap series in generic/344 and > generic/346. It's easy to reproduce because the iomap's buffered write path > doesn't hold folio lock while inserting delalloc blocks, so it could be raced > by the mmap page fault path. But the buffer_head's buffered write path can't > trigger this problem, ya right! That's the difference between how ->map_blocks() is called between buffer_head v/s iomap path. In iomap the ->map_blocks() call happens first to map a large extent and then it iterate over all the locked folios covering the mapped extent for doing writes. Whereas in buffer_head while iterating, we first instantiate/lock the folio and then call ->map_blocks() to map an extent for the given folio. ... So this opens up this window for a race between iomap buffered write path v/s page mkwrite path for inserting delalloc blocks entries. > the race between buffered write path and fallocate path > was discovered while I was analyzing the code, so I'm not sure if it could > be caught by xfstests now, at least I haven't noticed this problem so far. > Did you mean the race between page fault path and fallocate path here? Because buffered write path and fallocate path should not have any race since both takes the inode_lock. I guess you meant page fault path and fallocate path for which you wrote this patch too :) I am surprised, why we cannot see the this race between page mkwrite and fallocate in fstests for inserting da entries to extent status cache. Because the race you identified looks like a legitimate race and is mostly happening since ext4_da_map_blocks() was not doing the right thing. ... looking at the src/holetest, it doesn't really excercise this path. So maybe we can writing such fstest to trigger this race. >>> I have reworded some of the commit message, feel free to use it if you >>> think this version is better. The use of which path uses which locks was >>> a bit confusing in the original commit message. >>> > > Thanks for the message improvement, it looks more clear then mine, I will > use it. > Glad, it was helpful. -ritesh