Re: xfs, iomap: fix data corrupton due to stale cached iomaps

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Oct 31, 2022 at 08:39:11PM -0700, Darrick J. Wong wrote:
> On Tue, Nov 01, 2022 at 11:34:05AM +1100, Dave Chinner wrote:
> > Recently a customer workload encountered a data corruption in a
> > specific multi-threaded write operation. The workload combined
> > racing unaligned adjacent buffered writes with low memory conditions
> > that caused both writeback and memory reclaim to race with the
> > writes.
> > 
> > The result of this was random partial blocks containing zeroes
> > instead of the correct data.  The underlying problem is that iomap
> > caches the write iomap for the duration of the write() operation,
> > but it fails to take into account that the extent underlying the
> > iomap can change whilst the write is in progress.
> 
> Wheeeee....
> 
> > The short story is that an iomap can span mutliple folios, and so
> > under low memory writeback can be cleaning folios the write()
> > overlaps. Whilst the overlapping data is cached in memory, this
> > isn't a problem, but because the folios are now clean they can be
> > reclaimed. Once reclaimed, the write() does the wrong thing when
> > re-instantiating partial folios because the iomap no longer reflects
> > the underlying state of the extent. e.g. it thinks the extent is
> > unwritten, so it zeroes the partial range, when in fact the
> > underlying extent is now written and so it should have read the data
> 
> DOH!!
> 
> > from disk.  This is how we get random zero ranges in the file
> > instead of the correct data.
> > 
> > The gory details of the race condition can be found here:
> > 
> > https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@xxxxxxxxxxxxxxxxxxx/
> > 
> > Fixing the problem has two aspects. The first aspect of the problem
> > is ensuring that iomap can detect a stale cached iomap during a
> > write in a race-free manner. We already do this stale iomap
> > detection in the writeback path, so we have a mechanism for
> > detecting that the iomap backing the data range may have changed
> > and needs to be remapped.
> > 
> > In the case of the write() path, we have to ensure that the iomap is
> > validated at a point in time when the page cache is stable and
> > cannot be reclaimed from under us. We also need to validate the
> > extent before we start performing any modifications to the folio
> > state or contents. Combine these two requirements together, and the
> > only "safe" place to validate the iomap is after we have looked up
> > and locked the folio we are going to copy the data into, but before
> > we've performed any initialisation operations on that folio.
> > 
> > If the iomap fails validation, we then mark it stale, unlock the
> > folio and end the write. This effectively means a stale iomap
> > results in a short write.
> 
> Does this short write echo all the way out to userspace?  Or do we
> merely go 'round iomap_iter() again?

It triggers another iomap_iter() loop because we haven't consumed
the data in the iov_iter yet. It essentially advances the iter to
the end of the short write and runs another begin/write iter/end
loop to continue the write with a new iomap. The stale flag is
needed so that a "no bytes written so return to the caller" can be
distinguished from "no bytes written because the iomap was
immediately found to be stale so we need to loop again to remap it".

> > write() in progress doesn't necessarily own the data in the page
> > cache over the range of the delalloc extent it just allocated.
> > 
> > As a result, we can't just truncate the page cache over the range
> > the write() didn't reach and punch all the delalloc extent. We have
> > to walk the page cache over the untouched range and skip over any
> > dirty data region in the cache in that range. Which is ....
> > non-trivial.
> > 
> > That is, iterating the page cache has to handle partially populated
> > folios (i.e. block size < page size) that contain data. The data
> > might be discontiguous within a folio. Indeed, there might be
> > *multiple* discontiguous data regions within a single folio. And to
> > make matters more complex, multi-page folios mean we just don't know
> > how many sub-folio regions we might have to iterate to find all
> > these regions. All the corner cases between the conversions and
> > rounding between filesystem block size, folio size and multi-page
> > folio size combined with unaligned write offsets kept breaking my
> > brain.
> > 
> > Eventually, I realised that if the XFS code tracked the processed
> > write regions by byte ranges instead of fileysetm block or page
> > cache index, we could simply use mapping_seek_hole_data() to find
> > the start and end of each discrete data region within the range we
> > needed to scan. SEEK_DATA finds the start of the cached data region,
> > SEEK_HOLE finds the end of the region. THese are byte based
> > interfaces that understand partially uptodate folio regions, and so
> > can iterate discrete sub-folio data regions directly. This largely
> > solved the problem of discovering the dirty regions we need to keep
> > the delalloc extent over.
> 
> Heh.  Clever!

It'd be clever if I didn't take a couple of weeks banging my head
against continual off-by-one and rounding bugs before I realised
there was a better way... :(

> > Of course, now xfs/196 fails. This is a error injection test that is
> > supposed to exercise the delalloc extent recover code that the above
> > fixes just completely reworked. the error injection assumes that it
> > can just truncate the page cache over the write and then punch out
> > the delalloc extent completely. This is fundamentally broken, and
> > only has been working by chance - the chance is that writes are page
> > aligned and page aligned writes don't install large folios in the
> > page cache.
> 
> Side question -- should we introduce a couple of debug knobs to slow
> down page cache write and writeback so that we can make it easier to
> (cough) experiment with races?

Perhaps so. There was once the equivalent in Irix (dating back as
far as 1996, IIRC) for this sort of testing.  Search for
XFSRACEDEBUG in the historic archive, there are:

#ifdef XFSRACEDEBUG
	delay_for_intr();
	delay(100);
#endif

clauses added and removed regularly in the data and metadata IO
paths through the early history of the code base....

> Now that we've been bitten hard by the writeback race, I wrote a
> testcase to exercise it.  The problem was solved long ago so the only
> way you can tell its working is to inject debugging printks and look for
> them, but I feel like we ought to have test points for these two serious
> corruption problems.

Well, I just assume writeback races with everything and mostly
everything is OK. What I didn't expect was racing writeback enabling
racing reclaim to free pages dirtied by a write() before the write() has
completed (i.e. while it is still dirtying pages and copying data
into the page cache). i.e. it was the racing page reclaim that exposed
the data corruption bugs, not writeback.

As it is, I suspect that we should actually try to run fstests in
massively constrained memcgs more often - all sorts of problems have
fallen out of doing this (e.g. finding GFP_KERNEL allocations
in explicit, known GFP_NOFS contexts)....

> i IOWs, with sub-folio block size, and not know what size folios are
> > in the cache, we can't actually guarantee that we can remove the
> 
> Is that supposed to read "...and not knowing what size folios..."?

*nod*

> > cached dirty folios from the cache via truncation, and hence the new
> > code will not remove the delalloc extents under those dirty folios.
> > As a result the error injection results is writing zeroes to disk
> > rather that removing the delalloc extents from memory. I can't make
> 
> rather *than*?

*nod*

> > this error injection to work the way it was intended, so I removed
> > it. The code that it is supposed to exercise is now exercised every time we
> > detect a stale iomap, so we have much better coverage of the failed
> > write error handling than the error injection provides us with,
> > anyway....
> > 
> > So, this passes fstests on 1kb and 4kb block sizes and the data
> > corruption reproducer does not detect data corruption, so this set
> > of fixes is /finally/ something I'd consider ready for merge.
> > Comments and testing welcome!
> 
> Urrrk.  Thank you for your work on this. :)
> 
> /me functionally braindead on halloween candy, will read the code
> tomorrow.  Well ok we'll see how long it takes curiosity to get the
> better of me...

Thanks!

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux