On Wed, Dec 04, 2024 at 08:12:05AM +1100, Dave Chinner wrote: > On Tue, Dec 03, 2024 at 09:54:41AM -0500, Brian Foster wrote: > > On Tue, Dec 03, 2024 at 01:08:38PM +1100, Dave Chinner wrote: > > > On Mon, Dec 02, 2024 at 10:26:14AM -0500, Brian Foster wrote: > > > > On Sat, Nov 30, 2024 at 09:39:29PM +0800, Long Li wrote: > > > We hold the MMAP_LOCK (filemap_invalidate_lock()) so no new pages > > > can be instantiated over the range whilst we are running > > > xfs_itruncate_extents(). hence once truncate_setsize() returns, we > > > are guaranteed that there will be no IO in progress or can be > > > started over the range we are removing. > > > > > > Really, the issue is that writeback mappings have to be able to > > > handle the range being mapped suddenly appear to be beyond EOF. > > > This behaviour is a longstanding writeback constraint, and is what > > > iomap_writepage_handle_eof() is attempting to handle. > > > > > > We handle this by only sampling i_size_read() whilst we have the > > > folio locked and can determine the action we should take with that > > > folio (i.e. nothing, partial zeroing, or skip altogether). Once > > > we've made the decision that the folio is within EOF and taken > > > action on it (i.e. moved the folio to writeback state), we cannot > > > then resample the inode size because a truncate may have started > > > and changed the inode size. > > > > > > We have to complete the mapping of the folio to disk blocks - the > > > disk block mapping is guaranteed to be valid for the life of the IO > > > because the folio is locked and under writeback - and submit the IO > > > so that truncate_pagecache() will unblock and invalidate the folio > > > when the IO completes. > > > > > > Hence writeback vs truncate serialisation is really dependent on > > > only sampling the inode size -once- whilst the dirty folio we are > > > writing back is locked. > > > > > > > Not sure I see how this is a serialization dependency given that > > writeback completion also samples i_size. > > Ah, I didn't explain what I meant very clearly, did I? > > What I mean was we can't sample i_size in the IO path without > specific checking/serialisation against truncate operations. And > that means once we have partially zeroed the contents of a EOF > straddling folio, we can't then sample the EOF again to determine > the length of valid data in the folio as this can race with truncate > and result in a different size for the data in the folio than we > prepared it for. > Ok, I think we're just saying the same thing using different words. > > But no matter, it seems a > > reasonable implementation to me to make the submission path consistent > > in handling eof. > > Yes, the IO completion path does sample it again via xfs_new_eof(). > However, as per above, it has specific checking for truncate down > races and handles them: > > /* > * If this I/O goes past the on-disk inode size update it unless it would > * be past the current in-core inode size. > */ > static inline xfs_fsize_t > xfs_new_eof(struct xfs_inode *ip, xfs_fsize_t new_size) > { > xfs_fsize_t i_size = i_size_read(VFS_I(ip)); > > >>>> if (new_size > i_size || new_size < 0) > >>>> new_size = i_size; > return new_size > ip->i_disk_size ? new_size : 0; > } > > If we have a truncate_setsize() called for a truncate down whilst > this IO is in progress, then xfs_new_eof() will see the new, smaller > inode isize. The clamp on new_size handles this situation, and we > then only triggers an update if the on-disk size is still smaller > than the new truncated size (i.e. the IO being completed is still > partially within the new EOF from the truncate down). > > So I don't think there's an issue here at all at IO completion; > it handles truncate down races cleanly... > Agree.. this was kind of the point of the submit side trimming. I'm not sure a second sample of i_size on submission for trimming purposes affects this in any problematic way either. Brian > > I wonder if this could just use end_pos returned from > > iomap_writepage_handle_eof()? > > Yeah, that was what I was thinking, but I haven't looked at the code > for long enough to have any real idea of whether that is sufficient > or not. > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx >