Re: [PATCH v4 02/34] ext4: check the extent status again before inserting delalloc block

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


On 2024/5/2 12:11, Ritesh Harjani (IBM) wrote:
> Dave Chinner <david@xxxxxxxxxxxxx> writes:
>> On Wed, May 01, 2024 at 05:49:50PM +0530, Ritesh Harjani wrote:
>>> Dave Chinner <david@xxxxxxxxxxxxx> writes:
>>>> On Wed, Apr 10, 2024 at 10:29:16PM +0800, Zhang Yi wrote:
>>>>> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
>>>>> Now we lookup extent status entry without holding the i_data_sem before
>>>>> inserting delalloc block, it works fine in buffered write path and
>>>>> because it holds i_rwsem and folio lock, and the mmap path holds folio
>>>>> lock, so the found extent locklessly couldn't be modified concurrently.
>>>>> But it could be raced by fallocate since it allocate block whitout
>>>>> holding i_rwsem and folio lock.
>>>>> ext4_page_mkwrite()             ext4_fallocate()
>>>>>  block_page_mkwrite()
>>>>>   ext4_da_map_blocks()
>>>>>    //find hole in extent status tree
>>>>>                                  ext4_alloc_file_blocks()
>>>>>                                   ext4_map_blocks()
>>>>>                                    //allocate block and unwritten extent
>>>>>    ext4_insert_delayed_block()
>>>>>     ext4_da_reserve_space()
>>>>>      //reserve one more block
>>>>>     ext4_es_insert_delayed_block()
>>>>>      //drop unwritten extent and add delayed extent by mistake
>>>> Shouldn't this be serialised by the file invalidation lock?  Hole
>>>> punching via fallocate must do this to avoid data use-after-free
>>>> bugs w.r.t racing page faults and all the other fallocate ops need
>>>> to serialise page faults to avoid page cache level data corruption.
>>>> Yet here we see a problem resulting from a fallocate operation
>>>> racing with a page fault....
>>> IIUC, fallocate operations which invalidates the page cache contents needs
>>> to take th invalidate_lock in exclusive mode to prevent page fault
>>> operations from loading pages for stale mappings (blocks which were
>>> marked free might get reused). This can cause stale data exposure.
>>> Here the fallocate operation require allocation of unwritten extents and
>>> does not require truncate of pagecache range. So I guess, it is not
>>> strictly necessary to hold the invalidate lock here.
>> True, but you can make exactly the same argument for write() vs
>> fallocate(). Yet this path in ext4_fallocate() locks out 
>> concurrent write()s and waits for DIOs in flight to drain. What
>> makes buffered writes triggered by page faults special?
>> i.e. if you are going to say "we don't need serialisation between
>> writes and fallocate() allocating unwritten extents", then why is it
>> still explicitly serialising against both buffered and direct IO and
>> not just truncate and other fallocate() operations?
>>> But I see XFS does take IOLOCK_EXCL AND MMAPLOCK_EXCL even for this operation.
>> Yes, that's the behaviour preallocation has had in XFS since we
>> introduced the MMAPLOCK almost a decade ago. This was long before
>> the file_invalidation_lock() was even a glimmer in Jan's eye.
>> btrfs does the same thing, for the same reasons. COW support makes
>> extent tree manipulations excitingly complex at times...
>>> I guess we could use the invalidate lock for fallocate operation in ext4
>>> too. However, I think we still require the current patch. The reason is
>>> ext4_da_map_blocks() call here first tries to lookup the extent status
>>> cache w/o any i_data_sem lock in the fastpath. If it finds a hole, it
>>> takes the i_data_sem in write mode and just inserts an entry into extent
>>> status cache w/o re-checking for the same under the exclusive lock. 
>>> ...So I believe we still should have this patch which re-verify under
>>> the write lock if whether any other operation has inserted any entry
>>> already or not.
>> Yup, I never said the code in the patch is wrong or unnecessary; I'm
>> commenting on the high level race condition that lead to the bug
>> beting triggered. i.e. that racing data modification operations with
>> low level extent manipulations is often dangerous and a potential
>> source of very subtle, hard to trigger, reproduce and debug issues
>> like the one reported...
> Yes, thanks for explaining and commenting on the high level design.
> It was indeed helpful. And I agree with your comment on, we can refactor
> out the common operations from fallocate path and use invalidate lock to
> protect against data modification (page fault) and extent manipulation
> path (fallocate operations).

Yeah, thanks for explanation and suggestion, too. After looking at your
discussion, I also suppose we could refactor a common helper and use the
file invalidation lock for the whole ext4 fallocate path, current code is
too scattered.


[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux