On Tue, Dec 03, 2024 at 01:08:38PM +1100, Dave Chinner wrote: > On Mon, Dec 02, 2024 at 10:26:14AM -0500, Brian Foster wrote: > > On Sat, Nov 30, 2024 at 09:39:29PM +0800, Long Li wrote: > > > When performing fsstress test with this patch set, there is a very low probability of > > > encountering an issue where isize is less than ioend->io_offset in iomap_add_to_ioend. > > > After investigation, this was found to be caused by concurrent with truncate operations. > > > Consider a scenario with 4K block size and a file size of 12K. > > > > > > //write back [8K, 12K] //truncate file to 4K > > > ---------------------- ---------------------- > > > iomap_writepage_map xfs_setattr_size > > folio is locked here > > > > iomap_writepage_handle_eof > > > truncate_setsize > > > i_size_write(inode, newsize) //update inode size to 4K > ... > > > > > > It appears that in extreme cases, folios beyond EOF might be written back, > > > resulting in situations where isize is less than pos. In such cases, > > > maybe we should not trim the io_size further. > > > > > > > Hmm.. it might be wise to characterize this further to determine whether > > there are potentially larger problems to address before committing to > > anything. For example, assuming truncate acquires ilock and does > > xfs_itruncate_extents() and whatnot before this ioend submits/completes, > > I don't think xfs_itruncate_extents() is the concern here - that > happens after the page cache and writeback has been sorted out and > the ILOCK has been taken and the page cache state should > have already been sorted out. truncate_setsize() does that for us; > it guarantees that all writeback in the truncate down range has > been completed and the page cache invalidated. > Ah this is what I was missing on previous read through. truncate_inode_pages_range() waits on writeback in its second pass. I was initially confused by what would have prevented removing a writeback folio in the first pass, but that is handled a level down in find_lock_entries() where it skips locked||writeback folios and thus leaves them for the second pass. > We hold the MMAP_LOCK (filemap_invalidate_lock()) so no new pages > can be instantiated over the range whilst we are running > xfs_itruncate_extents(). hence once truncate_setsize() returns, we > are guaranteed that there will be no IO in progress or can be > started over the range we are removing. > > Really, the issue is that writeback mappings have to be able to > handle the range being mapped suddenly appear to be beyond EOF. > This behaviour is a longstanding writeback constraint, and is what > iomap_writepage_handle_eof() is attempting to handle. > > We handle this by only sampling i_size_read() whilst we have the > folio locked and can determine the action we should take with that > folio (i.e. nothing, partial zeroing, or skip altogether). Once > we've made the decision that the folio is within EOF and taken > action on it (i.e. moved the folio to writeback state), we cannot > then resample the inode size because a truncate may have started > and changed the inode size. > > We have to complete the mapping of the folio to disk blocks - the > disk block mapping is guaranteed to be valid for the life of the IO > because the folio is locked and under writeback - and submit the IO > so that truncate_pagecache() will unblock and invalidate the folio > when the IO completes. > > Hence writeback vs truncate serialisation is really dependent on > only sampling the inode size -once- whilst the dirty folio we are > writing back is locked. > Not sure I see how this is a serialization dependency given that writeback completion also samples i_size. But no matter, it seems a reasonable implementation to me to make the submission path consistent in handling eof. I wonder if this could just use end_pos returned from iomap_writepage_handle_eof()? Brian > I suspect that we can store and pass the sampled inode size through > the block mapping and ioend management code so it is constant for > the entire folio IO submission process, but whether we can do that > and still fix the orginal issue that we are trying to fix is not > something I've considered at this point.... > > > does anything in that submission or completion path detect and handle > > this scenario gracefully? What if the ioend happens to be unwritten > > post-eof preallocation and completion wants to convert blocks that might > > no longer exist in the file..? > > That can't happen because writeback must complete before > truncate_setsize() will be allowed to remove the pages from the > cache before xfs_itruncate_extents() can run. > > -Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx >