On Fri, Sep 23, 2022 at 09:16:55AM +1000, Dave Chinner wrote: > On Wed, Sep 21, 2022 at 08:40:29PM -0700, Darrick J. Wong wrote: > > On Wed, Sep 21, 2022 at 06:29:58PM +1000, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > > > > > A recent multithreaded write data corruption has been uncovered in > > > the iomap write code. The core of the problem is partial folio > > > writes can be flushed to disk while a new racing write can map it > > > and fill the rest of the page: > > > > > > writeback new write > > > > > > allocate blocks > > > blocks are unwritten > > > submit IO > > > ..... > > > map blocks > > > iomap indicates UNWRITTEN range > > > loop { > > > lock folio > > > copyin data > > > ..... > > > IO completes > > > runs unwritten extent conv > > > blocks are marked written > > > <iomap now stale> > > > get next folio > > > } > > > > > > Now add memory pressure such that memory reclaim evicts the > > > partially written folio that has already been written to disk. > > > > > > When the new write finally gets to the last partial page of the new > > > write, it does not find it in cache, so it instantiates a new page, > > > sees the iomap is unwritten, and zeros the part of the page that > > > it does not have data from. This overwrites the data on disk that > > > was originally written. > > > > > > The full description of the corruption mechanism can be found here: > > > > > > https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@xxxxxxxxxxxxxxxxxxx/ > > > > > > To solve this problem, we need to check whether the iomap is still > > > valid after we lock each folio during the write. We have to do it > > > after we lock the page so that we don't end up with state changes > > > occurring while we wait for the folio to be locked. > > > > > > Hence we need a mechanism to be able to check that the cached iomap > > > is still valid (similar to what we already do in buffered > > > writeback), and we need a way for ->begin_write to back out and > > > tell the high level iomap iterator that we need to remap the > > > remaining write range. > > > > > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > > > --- > > > fs/iomap/buffered-io.c | 53 +++++++++++++++++++++++++++++++++--------- > > > fs/iomap/iter.c | 33 ++++++++++++++++++++++++-- > > > include/linux/iomap.h | 17 ++++++++++++++ > > > 3 files changed, 90 insertions(+), 13 deletions(-) > > > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > > index ca5c62901541..44c806d46be4 100644 > > > --- a/fs/iomap/buffered-io.c > > > +++ b/fs/iomap/buffered-io.c > > > @@ -584,8 +584,9 @@ static int iomap_write_begin_inline(const struct iomap_iter *iter, > > > return iomap_read_inline_data(iter, folio); > > > } > > > > > > -static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos, > > > - size_t len, struct folio **foliop) > > > +static int iomap_write_begin(struct iomap_iter *iter, > > > + const struct iomap_ops *ops, loff_t pos, size_t len, > > > + struct folio **foliop) > > > { > > > const struct iomap_page_ops *page_ops = iter->iomap.page_ops; > > > const struct iomap *srcmap = iomap_iter_srcmap(iter); > > > @@ -618,6 +619,27 @@ static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos, > > > status = (iter->flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOMEM; > > > goto out_no_page; > > > } > > > + > > > + /* > > > + * Now we have a locked folio, before we do anything with it we need to > > > + * check that the iomap we have cached is not stale. The inode extent > > > + * mapping can change due to concurrent IO in flight (e.g. > > > + * IOMAP_UNWRITTEN state can change and memory reclaim could have > > > > Earlier, we were talking this specific problem involves zeroing folio > > areas around a folio-unaligned write to a !uptodate folio backed by an > > unwritten extent. Is the scope of this corruption still limited to > > *just* that case? Or does it affect any write to a new mapping or > > beyond EOF? Can it affect a write to an extent that somehow gets > > switched to unwritten or a hole after the mapping has been sampled? > > We only hold the IOLOCK while we are iterating the cached mapping. > Hence anything that can modify the extent tree without holding the > IOLOCK in a way that would exclude iomap iteration can trigger stale > mappings. > > AFAICT, the ways this can happen right now are: > > 1. page fault path takes mapping->invalidate_lock, not IOLOCK > 2. writeback allocating delalloc regions as unwritten > 3. IO completion converting unwritten regions to written. > 4. DIO writes doing extent allocation under IOLOCK_SHARED while > other iomap iterators are holding IOLOCK_SHARED, too. (i.e. hole > -> unwritten). <nod> Ok, I had kinda suspected that this applied to any mapping, since a lot can go on with IOLOCK_SHARED held. > Note that this affects my recent suggestion about using > IOLOCK_SHARED for buffered writes, too - that would be case #5, > similar to #4. <nod> > Stuff that takes the IOLOCK_EXCL, such as truncate, fallocate, etc > for direct extent manipulation cannot trigger stale cached iomaps in > the IO path because they wait for all IO path operations to drain > and for dirty data to flush across the range they are operating on. > > > > + * reclaimed a previously partially written page at this index after IO > > > + * completion before this write reaches this file offset) and hence we > > > + * could do the wrong thing here (zero a page range incorrectly or fail > > > + * to zero) and corrupt data. > > > + */ > > > + if (ops->iomap_valid) { > > > + bool iomap_valid = ops->iomap_valid(iter->inode, &iter->iomap); > > > > At this point in the (xfs) buffered write cycle, we've taken the IOLOCK > > and folio lock. Is it possible for ->iomap_valid to detect that the > > mapping is no longer valid, trylock the ILOCK, and update the mapping so > > that we don't have to drop the folio lock and go all the way back to > > ->iomap_begin? > > Possibly. But we can fail to get the ILOCK, and that means we still > have to fall back all to being outside the page lock before we can > try again. <nod> > The other issue is that the iomap iter still currently points to the > offset at the start of the write, so we aren't guaranteed that the > refetched iomap even covers the range of the page we currently have > locked. IOWs, we have to advance the iterator before we remap the > range to guarantee that the new iomap covers the range of the file > that we are currently operating on.... Aha! I hadn't caught on to that subtlety. > > If we have to pay the cost of an indirect call, we could try to update > > the mapping if one's available, right? Obviously, if we don't find a > > mapping we're probably going to have to drop the folio lock and allocate > > more space. > > I'd prefer that we get the slow, fall-back-all-the-way path correct > first. We are always going to need this slow path, so let's focus on > making it correct and robust rather than trying to prematurely > optimise it.... Agreed. > > > --- a/fs/iomap/iter.c > > > +++ b/fs/iomap/iter.c > > > @@ -7,12 +7,36 @@ > > > #include <linux/iomap.h> > > > #include "trace.h" > > > > > > +/* > > > + * Advance to the next range we need to map. > > > + * > > > + * If the iomap is marked IOMAP_F_STALE, it means the existing map was not fully > > > + * processed - it was aborted because the extent the iomap spanned may have been > > > + * changed during the operation. In this case, the iteration behaviour is to > > > + * remap the unprocessed range of the iter, and that means we may need to remap > > > + * even when we've made no progress (i.e. iter->processed = 0). Hence the > > > + * "finished iterating" case needs to distinguish between > > > + * (processed = 0) meaning we are done and (processed = 0 && stale) meaning we > > > + * need to remap the entire remaining range. > > > + * > > > + * We also need to preserve IOMAP_F_STALE on the iomap so that the next call > > > + * to iomap_begin() knows that it is reprocessing a stale map. Similarly, we > > > + * need to preserve IOMAP_F_NEW as the filesystem may not realise that it > > > + * is remapping a region that it allocated in the previous cycle and we still > > > + * need to initialise partially filled pages within the remapped range. > > > > Must the ->iomap_begin implementation also take care to carry-over the > > IOMAP_F_NEW to the mapping if IOMAP_F_STALE is set? Is it responsible > > for dropping IOMAP_F_STALE, or will iomap_iter* do that? > > iomap_iter() carries STALE and NEW over to the the next > ->iomap_begin() call, which is then responsible for clearing STALE > and propagating NEW if required. Ok. I suspect from the IRC traffic that you've moved on a bit from this approach, so I'll wait for it to appear. --D > > > > diff --git a/include/linux/iomap.h b/include/linux/iomap.h > > > index 238a03087e17..308931f0840a 100644 > > > --- a/include/linux/iomap.h > > > +++ b/include/linux/iomap.h > > > @@ -62,8 +62,13 @@ struct vm_fault; > > > * > > > * IOMAP_F_SIZE_CHANGED indicates to the iomap_end method that the file size > > > * has changed as the result of this write operation. > > > + * > > > + * IOMAP_F_STALE indicates that the iomap is not valid any longer and the file > > > + * range it covers needs to be remapped by the high level before the operation > > > + * can proceed. > > > > I think it might be worth mentioning that /any/ iomap_iter caller can > > set this flag in the loop body and kick things back to ->iomap_begin. > > *nod* > > -Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx