Re: [PATCH v4 02/34] ext4: check the extent status again before inserting delalloc block

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> writes:

> On 2024/4/26 20:57, Ritesh Harjani (IBM) wrote:
>> Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx> writes:
>> 
>>> Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> writes:
>>>
>>>> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
>>>>
>>>> Now we lookup extent status entry without holding the i_data_sem before
>>>> inserting delalloc block, it works fine in buffered write path and
>>>> because it holds i_rwsem and folio lock, and the mmap path holds folio
>>>> lock, so the found extent locklessly couldn't be modified concurrently.
>>>> But it could be raced by fallocate since it allocate block whitout
>>>> holding i_rwsem and folio lock.
>>>>
>>>> ext4_page_mkwrite()             ext4_fallocate()
>>>>  block_page_mkwrite()
>>>>   ext4_da_map_blocks()
>>>>    //find hole in extent status tree
>>>>                                  ext4_alloc_file_blocks()
>>>>                                   ext4_map_blocks()
>>>>                                    //allocate block and unwritten extent
>>>>    ext4_insert_delayed_block()
>>>>     ext4_da_reserve_space()
>>>>      //reserve one more block
>>>>     ext4_es_insert_delayed_block()
>>>>      //drop unwritten extent and add delayed extent by mistake
>>>>
>>>> Then, the delalloc extent is wrong until writeback, the one more
>>>> reserved block can't be release any more and trigger below warning:
>>>>
>>>>  EXT4-fs (pmem2): Inode 13 (00000000bbbd4d23): i_reserved_data_blocks(1) not cleared!
>>>>
>>>> Hold i_data_sem in write mode directly can fix the problem, but it's
>>>> expansive, we should keep the lockless check and check the extent again
>>>> once we need to add an new delalloc block.
>>>
>>> Hi Zhang, 
>>>
>>> It's a nice finding. I was wondering if this was caught in any of the
>>> xfstests?
>>>
>
> Hi, Ritesh
>
> I caught this issue when I tested my iomap series in generic/344 and
> generic/346. It's easy to reproduce because the iomap's buffered write path
> doesn't hold folio lock while inserting delalloc blocks, so it could be raced
> by the mmap page fault path. But the buffer_head's buffered write path can't
> trigger this problem,

ya right! That's the difference between how ->map_blocks() is called
between buffer_head v/s iomap path. In iomap the ->map_blocks() call
happens first to map a large extent and then it iterate over all the
locked folios covering the mapped extent for doing writes.
Whereas in buffer_head while iterating, we first instantiate/lock the
folio and then call ->map_blocks() to map an extent for the given folio.

... So this opens up this window for a race between iomap buffered write
path v/s page mkwrite path for inserting delalloc blocks entries.

> the race between buffered write path and fallocate path
> was discovered while I was analyzing the code, so I'm not sure if it could
> be caught by xfstests now, at least I haven't noticed this problem so far.
>

Did you mean the race between page fault path and fallocate path here?
Because buffered write path and fallocate path should not have any race
since both takes the inode_lock. I guess you meant page fault path and
fallocate path for which you wrote this patch too :)

I am surprised, why we cannot see the this race between page mkwrite and
fallocate in fstests for inserting da entries to extent status cache.
Because the race you identified looks like a legitimate race and is
mostly happening since ext4_da_map_blocks() was not doing the right
thing.
... looking at the src/holetest, it doesn't really excercise this path.
So maybe we can writing such fstest to trigger this race.


>>> I have reworded some of the commit message, feel free to use it if you
>>> think this version is better. The use of which path uses which locks was
>>> a bit confusing in the original commit message.
>>>
>
> Thanks for the message improvement, it looks more clear then mine, I will
> use it.
>

Glad, it was helpful.

-ritesh




[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux