Am Mo., 9. Jan. 2023 um 23:58 Uhr schrieb Dave Chinner <david@xxxxxxxxxxxxx>: > On Mon, Jan 09, 2023 at 07:45:27PM +0100, Andreas Gruenbacher wrote: > > On Sun, Jan 8, 2023 at 10:59 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > On Sun, Jan 08, 2023 at 08:40:32PM +0100, Andreas Gruenbacher wrote: > > > > Eliminate the ->iomap_valid() handler by switching to a ->get_folio() > > > > handler and validating the mapping there. > > > > > > > > Signed-off-by: Andreas Gruenbacher <agruenba@xxxxxxxxxx> > > > > > > I think this is wrong. > > > > > > The ->iomap_valid() function handles a fundamental architectural > > > issue with cached iomaps: the iomap can become stale at any time > > > whilst it is in use by the iomap core code. > > > > > > The current problem it solves in the iomap_write_begin() path has to > > > do with writeback and memory reclaim races over unwritten extents, > > > but the general case is that we must be able to check the iomap > > > at any point in time to assess it's validity. > > > > > > Indeed, we also have this same "iomap valid check" functionality in the > > > writeback code as cached iomaps can become stale due to racing > > > writeback, truncated, etc. But you wouldn't know it by looking at the iomap > > > writeback code - this is currently hidden by XFS by embedding > > > the checks into the iomap writeback ->map_blocks function. > > > > > > That is, the first thing that xfs_map_blocks() does is check if the > > > cached iomap is valid, and if it is valid it returns immediately and > > > the iomap writeback code uses it without question. > > > > > > The reason that this is embedded like this is that the iomap did not > > > have a validity cookie field in it, and so the validity information > > > was wrapped around the outside of the iomap_writepage_ctx and the > > > filesystem has to decode it from that private wrapping structure. > > > > > > However, the validity information iin the structure wrapper is > > > indentical to the iomap validity cookie, > > > > Then could that part of the xfs code be converted to use > > iomap->validity_cookie so that struct iomap_writepage_ctx can be > > eliminated? > > Yes, that is the plan. > > > > > > and so the direction I've > > > been working towards is to replace this implicit, hidden cached > > > iomap validity check with an explicit ->iomap_valid call and then > > > only call ->map_blocks if the validity check fails (or is not > > > implemented). > > > > > > I want to use the same code for all the iomap validity checks in all > > > the iomap core code - this is an iomap issue, the conditions where > > > we need to check for iomap validity are different for depending on > > > the iomap context being run, and the checks are not necessarily > > > dependent on first having locked a folio. > > > > > > Yes, the validity cookie needs to be decoded by the filesystem, but > > > that does not dictate where the validity checking needs to be done > > > by the iomap core. > > > > > > Hence I think removing ->iomap_valid is a big step backwards for the > > > iomap core code - the iomap core needs to be able to formally verify > > > the iomap is valid at any point in time, not just at the point in > > > time a folio in the page cache has been locked... > > > > We don't need to validate an iomap "at any time". It's two specific > > places in the code in which we need to check, and we're not going to > > end up with ten more such places tomorrow. > > Not immediately, but that doesn't change the fact this is not a > filesystem specific issue - it's an inherent characteristic of > cached iomaps and unsynchronised extent state changes that occur > outside exclusive inode->i_rwsem IO context (e.g. in writeback and > IO completion contexts). > > Racing mmap + buffered writes can expose these state changes as the > iomap bufferred write IO path is not serialised against the iomap > mmap IO path except via folio locks. Hence a mmap page fault can > invalidate a cached buffered write iomap by causing a hole -> > unwritten, hole -> delalloc or hole -> written conversion in the > middle of the buffered write range. The buffered write still has a > hole mapping cached for that entire range, and it is now incorrect. > > If the mmap write happens to change extent state at the trailing > edge of a partial buffered write, data corruption will occur if we > race just right with writeback and memory reclaim. I'm pretty sure > that this corruption can be reporduced on gfs2 if we try hard enough > - generic/346 triggers the mmap/write race condition, all that is > needed from that point is for writeback and reclaiming pages at > exactly the right time... > > > I'd prefer to keep those > > filesystem internals in the filesystem specific code instead of > > exposing them to the iomap layer. But that's just me ... > > My point is that there is nothing XFS specific about these stale > cached iomap race conditions, nor is it specifically related to > folio locking. The folio locking inversions w.r.t. iomap caching and > the interactions with writeback and reclaim are simply the > manifestation that brought the issue to our attention. > > This is why I think hiding iomap validation filesystem specific page > cache allocation/lookup functions is entirely the wrong layer to be > doing iomap validity checks. Especially as it prevents us from > adding more validity checks in the core infrastructure when we need > them in future. > > AFAIC, an iomap must carry with it a method for checking > that it is still valid. We need it in the write path, we need it in > the writeback path. If we want to relax the restrictions on clone > operations (e.g. shared locking on the source file), we'll need to > be able to detect stale cached iomaps in those paths, too. And I > haven't really thought through all the implications of shared > locking on buffered writes yet, but that may well require more > checks in other places as well. > > > If we ignore this particular commit for now, do you have any > > objections to the patches in this series? If not, it would be great if > > we could add the other patches to iomap-for-next. > > I still don't like moving page cache operations into individual > filesystems, but for the moment I can live with the IOMAP_NOCREATE > hack to drill iomap state through the filesystem without the > filesystem being aware of it. Alright, works for me. Darrick? > > By the way, I'm still not sure if gfs2 is affected by this whole iomap > > validation drama given that it neither implements unwritten extents > > nor delayed allocation. This is a mess. > > See above - I'm pretty sure it will be, but it may be very difficult > to expose. After all, it's taken several years before anyone noticed > this issue with XFS, even though we were aware of the issue of stale > cached iomaps causing data corruption in the writeback path.... Okay, that's all pretty ugly. Thanks a lot for the detailed explanation. Cheers, Andreas > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx