Re: [RFC PATCH 0/2] iomap/xfs: fix data corruption due to stale cached iomaps

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Wed, 28 Sep 2022 19:15:26 -0700

On Thu, Sep 29, 2022 at 12:11:28PM +1000, Dave Chinner wrote:
> On Tue, Sep 27, 2022 at 10:16:42PM -0700, Darrick J. Wong wrote:
> > On Fri, Sep 23, 2022 at 08:59:34AM +1000, Dave Chinner wrote:
> > > On Wed, Sep 21, 2022 at 09:25:26PM -0700, Darrick J. Wong wrote:
> > > > On Wed, Sep 21, 2022 at 06:29:57PM +1000, Dave Chinner wrote:
> > > > > Hi folks,
> > > > > 
> > > > > THese patches address the data corruption first described here:
> > > > > 
> > > > > https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@xxxxxxxxxxxxxxxxxxx/
> > > > > 
> > > > > This data corruption has been seen in high profile production
> > > > > systems so there is some urgency to fix it. The underlying flaw is
> > > > > essentially a zero-day iomap bug, so whatever fix we come up with
> > > > > needs to be back portable to all supported stable kernels (i.e.
> > > > > ~4.18 onwards).
> > > > > 
> > > > > A combination of concurrent write()s, writeback IO completion, and
> > > > > memory reclaim combine to expose the fact that the cached iomap that
> > > > > is held across an iomap_begin/iomap_end iteration can become stale
> > > > > without the iomap iterator actor being aware that the underlying
> > > > > filesystem extent map has changed.
> > > > > 
> > > > > Hence actions based on the iomap state (e.g. is unwritten or newly
> > > > > allocated) may actually be incorrect as writeback actions may have
> > > > > changed the state (unwritten to written, delalloc to unwritten or
> > > > > written, etc). This affects partial block/page operations, where we
> > > > > may need to read from disk or zero cached pages depending on the
> > > > > actual extent state. Memory reclaim plays it's part here in that it
> > > > > removes pages containing partial state from the page cache, exposing
> > > > > future partial page/block operations to incorrect behaviour.
> > > > > 
> > > > > Really, we should have known that this would be a problem - we have
> > > > > exactly the same issue with cached iomaps for writeback, and the
> > > > > ->map_blocks callback that occurs for every filesystem block we need
> > > > > to write back is responsible for validating the cached iomap is
> > > > > still valid. The data corruption on the write() side is a result of
> > > > > not validating that the iomap is still valid before we initialise
> > > > > new pages and prepare them for data to be copied in to them....
> > > > > 
> > > > > I'm not really happy with the solution I have for triggering
> > > > > remapping of an iomap when the current one is considered stale.
> > > > > Doing the right thing requires both iomap_iter() to handle stale
> > > > > iomaps correctly (esp. the "map is invalid before the first actor
> > > > > operation" case), and it requires the filesystem
> > > > > iomap_begin/iomap_end operations to co-operate and be aware of stale
> > > > > iomaps.
> > > > > 
> > > > > There are a bunch of *nasty* issues around handling failed writes in
> > > > > XFS taht this has exposed - a failed write() that races with a
> > > > > mmap() based write to the same delalloc page will result in the mmap
> > > > > writes being silently lost if we punch out the delalloc range we
> > > > > allocated but didn't write to. g/344 and g/346 expose this bug
> > > > > directly if we punch out delalloc regions allocated by now stale
> > > > > mappings.
> > > > 
> > > > Yuck.  I'm pretty sure that callers (xfs_buffered_write_iomap_end) is
> > > > supposed to call truncate_pagecache_range with the invalidatelock (fka
> > > > MMAPLOCK) held.
> > > 
> > > Yup, there's multiple problems with this code; apart from
> > > recognising that it is obviously broken and definitely problematic,
> > > I haven't dug into it further.
> > 
> > ...and I've been so buried in attending meetings and livedebug sessions
> > related to a 4.14 corruption that now I'm starved of time to fully think
> > through all the implications of this one. :(
> > 
> > > > > Then, because we can't punch out the delalloc we allocated region
> > > > > safely when we have a stale iomap, we have to ensure when we remap
> > > > > it the IOMAP_F_NEW flag is preserved so that the iomap code knows
> > > > > that it is uninitialised space that is being written into so it will
> > > > > zero sub page/sub block ranges correctly.
> > > > 
> > > > Hm.  IOMAP_F_NEW results in zeroing around, right?  So if the first
> > > > ->iomap_begin got a delalloc mapping, but by the time we got the folio
> > > > locked someone else managed to writeback and evict the page, we'd no
> > > > longer want that zeroing ... right?
> > > 
> > > Yes, and that is one of the sources of the data corruption - zeroing
> > > when we shouldn't.
> > > 
> > > There are multiple vectors to having a stale iomap here:
> > > 
> > > 1. we allocate the delalloc range, giving us IOMAP_DELALLOC and
> > >    IOMAP_F_NEW. Writeback runs, allocating the range as unwritten.
> > >    Even though the iomap is now stale, there is no data corruption
> > >    in this case because the range is unwritten and so we still need
> > >    zeroing.
> > 
> > ...and I guess this at least happens more often now that writeback does
> > delalloc -> unwritten -> write -> unwritten conversion?
> 
> *nod*
> 
> > > 2. Same as above, but IO completion converts the range to written.
> > >    Data corruption occurs in this case because IOMAP_F_NEW causes
> > >    incorrect page cache zeroing to occur on partial page writes.
> > > 
> > > 3. We have an unwritten extent (prealloc, writeback in progress,
> > >    etc) so we have IOMAP_UNWRITTEN. These require zeroing,
> > >    regardless of whether IOMAP_F_NEW is set or not. Extent is
> > >    written behind our backs, unwritten conversion occurs, and now we
> > >    zero partial pages when we shouldn't.
> > 
> > Yikes.
> > 
> > > Other issues I've found:
> > > 
> > > 4. page faults can run the buffered write path concurrently with
> > >    write() because they aren't serialised against each other. Hence
> > >    we can have overlapping concurrent iomap_iter() operations with
> > >    different zeroing requirements and it's anyone's guess as to
> > >    which will win the race to the page lock and do the initial
> > >    zeroing. This is a potential silent mmap() write data loss
> > >    vector.
> > 
> > TBH I've long wondered why IOLOCK and MMAPLOCK both seemingly protected
> > pagecache operations but the buffered io paths never seemed to take the
> > MMAPLOCK, and if there was some subtle way things could go wrong.
> 
> We can't take MMAPLOCK in the buffered IO path because the user
> buffer could be a mmap()d range of the same file and we need to be
> able to fault in those pages during copyin/copyout. Hence we can't
> hold the MMAPLOCK across iomap_iter(), nor across
> .iomap_begin/.iomap_end context pairs.

Ahh, right, I forgot that case. >:O

> taking the MMAPLOCK and dropping it again can be done in iomap_begin
> or iomap_end, as long as those methods aren't called from the page
> fault path....
> 
> > > 5. anything that can modify the extent layout without holding the
> > >    i_rwsem exclusive can race with iomap iterating the extent list.
> > >    Holding the i_rwsem shared and modifying the extent list (e.g.
> > >    direct IO writes) can result in iomaps changing in the middle of,
> > >    say, buffered reads (e.g. hole->unwritten->written).
> > 
> > Yep.  I wonder, can this result in other incorrect write behavior that
> > you and I haven't thought of yet?
> 
> Entirely possible - this code is complex and there are lots of very
> subtle interactions and we've already found several bonus broken
> bits as a result. Hence I wouldn't be surprised if we've missed
> other subtle issues and/or not fully grokked the implications of the
> broken bits we've found...
> 
> [....]
> 
> > > > What happens if iomap_writepage_map errors out (say because ->map_blocks
> > > > returns an error) without adding the folio to any ioend?
> > > 
> > > Without reading further:
> > > 
> > > 1. if we want to retry the write, we folio_redirty_for_writepage(),
> > > unlock it and return with no error. Essentially we just skip over
> > > it.
> > 
> > If the fs isn't shut down, I guess we could redirty the page, though I
> > guess the problem is that the page is now stuck in dirty state until
> > xfs_scrub fixes the problem.  If it fixes the problem.
> > 
> > I think for bufferhead users it's nastier because we might have a
> > situation where pagedirty is unset but BH_Dirty is still set.  It
> > certainly is a problem on 4.14.
> > 
> > > 2. If we want to fail the write, we should call set_mapping_error()
> > > to record the failure for the next syscall to report and, maybe, set
> > > the error flag/clear the uptodate flag on the folio depending on
> > > whether we want the data to remain valid in memory or not.
> > 
> > <nod> That seems to be happening.  Sort of.
> > 
> > I think there's also a UAF in iomap_writepage_map -- if the folio is
> > unlocked and we cleared (or never set) PageWriteback, isn't it possible
> > that by the time we get to the mapping_set_error, the folio could have
> > been torn out of the page cache and reused somewhere else?
> 
> We still have a reference to the folio at this point from the lookup
> in write_cache_pages(). Hence the folio can't be freed while we are
> running iomap_writepage_map().
> 
> However, we have unlocked the folio, and we don't hold either the IO
> lock or the invalidate lock and so the folio could get punched out
> of the page cache....
> 
> > In which case, we're at best walking off a NULL mapping and crashing the
> > system, and at worst setting an IO error on the wrong mapping?
> 
> Yes, I think so - we could walk off a NULL mapping here,
> but because write_cache_pages() still holds a page reference, the
> page won't get freed from under us so we won't ever see the wrong
> mapping being set here.
> 
> I think we could fix that simply by using inode->i_mapping instead
> of folio->mapping...

Oh.  Yes.  I'll get on that tomorrow.

> > > > I think in
> > > > that case we'll follow the (error && !count) case, in which we unlock
> > > > the folio and exit without calling folio_redirty_for_writepage, right?
> > > > The error will get recorded in the mapping for the next fsync, I think,
> > > > but I also wonder if we *should* redirty because the mapping failed, not
> > > > the attempt at persistence.
> > > 
> > > *nod*
> > > 
> > > I think the question that needs to be answered here is this: in what
> > > case is an error being returned from ->map_blocks a recoverable
> > > error that a redirty + future writeback retry will succeed?
> > > 
> > > AFAICT, all cases from XFS this is a fatal error (e.g. corruption of
> > > the BMBT), so the failure will persist across all attempts to retry
> > > the write?
> > > 
> > > Perhaps online repair will change this (i.e. in the background
> > > repair fixes the BMBT corruption and so the next attempt to write
> > > the data will succeed) so I can see that we *might* need to redirty
> > > the page in this case, but....
> > 
> > ...but I don't know that we can practically wait for repairs to happen
> > because the page is now stuck in dirty state indefinitely.
> 
> *nod*
> 
> So do we treat it as fatal for now, and revisit it later when online
> repair might be able to do something better here? 

Sounds good to me.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx