On 2024/5/2 12:11, Ritesh Harjani (IBM) wrote: > Dave Chinner <david@xxxxxxxxxxxxx> writes: > >> On Wed, May 01, 2024 at 05:49:50PM +0530, Ritesh Harjani wrote: >>> Dave Chinner <david@xxxxxxxxxxxxx> writes: >>> >>>> On Wed, Apr 10, 2024 at 10:29:16PM +0800, Zhang Yi wrote: >>>>> From: Zhang Yi <yi.zhang@xxxxxxxxxx> >>>>> >>>>> Now we lookup extent status entry without holding the i_data_sem before >>>>> inserting delalloc block, it works fine in buffered write path and >>>>> because it holds i_rwsem and folio lock, and the mmap path holds folio >>>>> lock, so the found extent locklessly couldn't be modified concurrently. >>>>> But it could be raced by fallocate since it allocate block whitout >>>>> holding i_rwsem and folio lock. >>>>> >>>>> ext4_page_mkwrite() ext4_fallocate() >>>>> block_page_mkwrite() >>>>> ext4_da_map_blocks() >>>>> //find hole in extent status tree >>>>> ext4_alloc_file_blocks() >>>>> ext4_map_blocks() >>>>> //allocate block and unwritten extent >>>>> ext4_insert_delayed_block() >>>>> ext4_da_reserve_space() >>>>> //reserve one more block >>>>> ext4_es_insert_delayed_block() >>>>> //drop unwritten extent and add delayed extent by mistake >>>> >>>> Shouldn't this be serialised by the file invalidation lock? Hole >>>> punching via fallocate must do this to avoid data use-after-free >>>> bugs w.r.t racing page faults and all the other fallocate ops need >>>> to serialise page faults to avoid page cache level data corruption. >>>> Yet here we see a problem resulting from a fallocate operation >>>> racing with a page fault.... >>> >>> IIUC, fallocate operations which invalidates the page cache contents needs >>> to take th invalidate_lock in exclusive mode to prevent page fault >>> operations from loading pages for stale mappings (blocks which were >>> marked free might get reused). This can cause stale data exposure. >>> >>> Here the fallocate operation require allocation of unwritten extents and >>> does not require truncate of pagecache range. So I guess, it is not >>> strictly necessary to hold the invalidate lock here. >> >> True, but you can make exactly the same argument for write() vs >> fallocate(). Yet this path in ext4_fallocate() locks out >> concurrent write()s and waits for DIOs in flight to drain. What >> makes buffered writes triggered by page faults special? >> >> i.e. if you are going to say "we don't need serialisation between >> writes and fallocate() allocating unwritten extents", then why is it >> still explicitly serialising against both buffered and direct IO and >> not just truncate and other fallocate() operations? >> >>> But I see XFS does take IOLOCK_EXCL AND MMAPLOCK_EXCL even for this operation. >> >> Yes, that's the behaviour preallocation has had in XFS since we >> introduced the MMAPLOCK almost a decade ago. This was long before >> the file_invalidation_lock() was even a glimmer in Jan's eye. >> >> btrfs does the same thing, for the same reasons. COW support makes >> extent tree manipulations excitingly complex at times... >> >>> I guess we could use the invalidate lock for fallocate operation in ext4 >>> too. However, I think we still require the current patch. The reason is >>> ext4_da_map_blocks() call here first tries to lookup the extent status >>> cache w/o any i_data_sem lock in the fastpath. If it finds a hole, it >>> takes the i_data_sem in write mode and just inserts an entry into extent >>> status cache w/o re-checking for the same under the exclusive lock. >>> ...So I believe we still should have this patch which re-verify under >>> the write lock if whether any other operation has inserted any entry >>> already or not. >> >> Yup, I never said the code in the patch is wrong or unnecessary; I'm >> commenting on the high level race condition that lead to the bug >> beting triggered. i.e. that racing data modification operations with >> low level extent manipulations is often dangerous and a potential >> source of very subtle, hard to trigger, reproduce and debug issues >> like the one reported... >> > > Yes, thanks for explaining and commenting on the high level design. > It was indeed helpful. And I agree with your comment on, we can refactor > out the common operations from fallocate path and use invalidate lock to > protect against data modification (page fault) and extent manipulation > path (fallocate operations). > Yeah, thanks for explanation and suggestion, too. After looking at your discussion, I also suppose we could refactor a common helper and use the file invalidation lock for the whole ext4 fallocate path, current code is too scattered. Thanks, Yi.