On Tue, Dec 03, 2024 at 09:54:41AM -0500, Brian Foster wrote: > On Tue, Dec 03, 2024 at 01:08:38PM +1100, Dave Chinner wrote: > > On Mon, Dec 02, 2024 at 10:26:14AM -0500, Brian Foster wrote: > > > On Sat, Nov 30, 2024 at 09:39:29PM +0800, Long Li wrote: > > We hold the MMAP_LOCK (filemap_invalidate_lock()) so no new pages > > can be instantiated over the range whilst we are running > > xfs_itruncate_extents(). hence once truncate_setsize() returns, we > > are guaranteed that there will be no IO in progress or can be > > started over the range we are removing. > > > > Really, the issue is that writeback mappings have to be able to > > handle the range being mapped suddenly appear to be beyond EOF. > > This behaviour is a longstanding writeback constraint, and is what > > iomap_writepage_handle_eof() is attempting to handle. > > > > We handle this by only sampling i_size_read() whilst we have the > > folio locked and can determine the action we should take with that > > folio (i.e. nothing, partial zeroing, or skip altogether). Once > > we've made the decision that the folio is within EOF and taken > > action on it (i.e. moved the folio to writeback state), we cannot > > then resample the inode size because a truncate may have started > > and changed the inode size. > > > > We have to complete the mapping of the folio to disk blocks - the > > disk block mapping is guaranteed to be valid for the life of the IO > > because the folio is locked and under writeback - and submit the IO > > so that truncate_pagecache() will unblock and invalidate the folio > > when the IO completes. > > > > Hence writeback vs truncate serialisation is really dependent on > > only sampling the inode size -once- whilst the dirty folio we are > > writing back is locked. > > > > Not sure I see how this is a serialization dependency given that > writeback completion also samples i_size. Ah, I didn't explain what I meant very clearly, did I? What I mean was we can't sample i_size in the IO path without specific checking/serialisation against truncate operations. And that means once we have partially zeroed the contents of a EOF straddling folio, we can't then sample the EOF again to determine the length of valid data in the folio as this can race with truncate and result in a different size for the data in the folio than we prepared it for. > But no matter, it seems a > reasonable implementation to me to make the submission path consistent > in handling eof. Yes, the IO completion path does sample it again via xfs_new_eof(). However, as per above, it has specific checking for truncate down races and handles them: /* * If this I/O goes past the on-disk inode size update it unless it would * be past the current in-core inode size. */ static inline xfs_fsize_t xfs_new_eof(struct xfs_inode *ip, xfs_fsize_t new_size) { xfs_fsize_t i_size = i_size_read(VFS_I(ip)); >>>> if (new_size > i_size || new_size < 0) >>>> new_size = i_size; return new_size > ip->i_disk_size ? new_size : 0; } If we have a truncate_setsize() called for a truncate down whilst this IO is in progress, then xfs_new_eof() will see the new, smaller inode isize. The clamp on new_size handles this situation, and we then only triggers an update if the on-disk size is still smaller than the new truncated size (i.e. the IO being completed is still partially within the new EOF from the truncate down). So I don't think there's an issue here at all at IO completion; it handles truncate down races cleanly... > I wonder if this could just use end_pos returned from > iomap_writepage_handle_eof()? Yeah, that was what I was thinking, but I haven't looked at the code for long enough to have any real idea of whether that is sufficient or not. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx