Re: RFE: dm-thin: fine-grained handling of REQ_FUA is needed

Mike Snitzer <snitzer@xxxxxxxxxx> · Tue, 23 Apr 2024 13:26:31 -0400



[one more time with feeling, after fixing dm-devel address typo]

On Tue, Apr 23, 2024 at 01:19:39PM -0400, Mike Snitzer wrote:
> [adding snitzer@xxxxxxxxxx and dm-devel to the Cc so I don't lose this important thread again]
> [this was old Red Hat internal discussion that I tripped over like a dead cat, but there isn't anything sensitive here: dm-thinp needs to improve]
> 
> On Sun, Jan 24 2021 at  8:34P -0500,
> Dave Chinner <dchinner@xxxxxxxxxx> wrote:
> 
> > On Thu, Jan 21, 2021 at 11:25:37PM -0500, Mike Snitzer wrote:
> > > On Thu, Jan 21 2021 at  9:33pm -0500,
> > > Dave Chinner <dchinner@xxxxxxxxxx> wrote:
> > > 
> > > > On Thu, Jan 21, 2021 at 04:23:20PM +0000, Joe Thornber wrote:
> > > > > On Thu, 2021-01-21 at 17:45 +1100, Dave Chinner wrote:
> > > > > > On Wed, Jan 20, 2021 at 10:34:01AM +0000, Joe Thornber wrote:
> > > > > > 
> > > > > > > - Currently the tools unshare _metadata_ when restoring/repairing
> > > > > > > pools
> > > > > > > with snapshots.  Combined with the hard 16gb limit on metadata size
> > > > > > > this means some systems with many snapshots can't be repaired if
> > > > > > > they
> > > > > > > run into trouble.  Next release of the tools will fix, but as said
> > > > > > > before, we have to assess the software as is.
> > > > > > 
> > > > > > Can you explain more about this limit and what it means? The test
> > > > > > I'm currently running (1000 iterations of "write 10,000 4kB IOs,
> > > > > > snapshot"
> > > > > 
> > > > > Right let's talk about your scenario.
> > > > > 
> > > > > Firstly metadata.  Our transactional model is based on persistent data
> > > > > structures in the functional programming sense.  To update metadata in
> > > > > a new transaction we COW key metadata pages to build a new data
> > > > > structure that shares data with that of the previous transaction.  We
> > > > > amortise the costs associated with this by holding the transactions
> > > > > open as long as possible and allowing real mutation *within* a
> > > > > transaction.
> > > > > 
> > > > > There are advantages and disadvantages with this scheme.
> > > > > 
> > > > > Advantages: It's simple.  It allows the 'take snapshot' operation to be
> > > > > very quick (just writing a handful of pages).  It also allows us to
> > > > > take a 'metadata snapshot' of all the metadata which userland can
> > > > > access safely.  So we can move the implementation of certain
> > > > > query/health functions entirely to userland. (see thin_check, thin_ls,
> > > > > thin_dump etc).
> > > > > 
> > > > > Disadvantages: The biggest down side is everything is done in one
> > > > > global transaction; if a commit is required by one thin device (eg, due
> > > > > to a REQ_FUA), then other active thin devices will be temporarily
> > > > > slowed as the transaction commits.
> > > > 
> > > > OUCH!
> > > > 
> > > > That means every journal IO from the filesystem on the thin vol will
> > > > trigger a globally serialising metadata commit. It also means every
> > > > user O_DSYNC/RWF_DSYNC direct IO (which we optimise to FUA writes
> > > > to avoid journal flushes) will trigger a thinp commit.
> > > 
> > > So XFS hijacks REQ_FUA to bypass its journaling _but_ expects that
> > > REQ_FUA to have no impact on the underlying storage?  Why not clear
> > > REQ_FUA after XFS decides to bypass its journal?  Why use REQ_FUA for
> > > this?  What are your assumptions about how the underlying block device
> > > will respond?
> > 
> > "Hijack" is a bit ... strong. We replaced a full device cache flush
> > with an FUA write at the request of a partner that demonstrated that
> > using FUA writes in their applications on other operating systems on
> > the same hardware configurations ran them 25-50% faster than on
> > RHEL/XFS.
> > 
> > I reproduced their results locally, and the difference in
> > performance was the device level overhead between write+full device
> > caceh flush (the traditional linux DIO O_DSYNC implementation) and
> > using FUA writes to avoid the full device cache flush when only the
> > data in the IO itself needed to be written to stable storage.
> > 
> > It's not at all controversial - REQ_FUA has been used by filesystems
> > for years for their journal writes to optimise away the post-write
> > cache flush on hardware that supports FUA. And ext4, gfs2, zonefs
> > and btrfs all use this same code direct IO code path now, so it's
> > not just XFS doing this...
> > 
> > FWIW, I mused on the weekend that we don't even need to check that
> > the block device supports FUA to make this optimisation, because the
> > block layer will convert REQ_FUA to a normal write + post write
> > cache flush if the underlying device doesn't support FUA. So we are
> > likely to make all pure O_DSYNC data overwrites use REQ_FUA in the
> > future, not just on devices that support FUA...
> > 
> > > > This is what enterprise applications do for IO performance - small,
> > > > random, highly concurrent O_DSYNC AIO+DIO overwrites - so it seems
> > > > like this is going to cause IO serialisation problems even if there is 
> > > > only one thin-vol in a thin-pool...
> > > 
> > > Yes, but I currently cannot say this is thinp's fault.  REQ_FUA must be
> > > honored.  If dm-thinp is being asked to flush caches and get a
> > > consistent point-in-time it cannot be ignored.
> > 
> > Yes, that's the point I was trying to make. FUA is not a cache flush
> > request - it's much more fine grained than that, and so can be used
> > to elide full device cache flushes. Hence requiring global commits
> > to honor FUA data writes defeats the higher level optimisations that
> > try to minimise the device cache flush overhead...
> > 
> > > So what additonal data structures and approach should we be taking to
> > > mask the impact of needing to keep data and metadata consistent?
> > 
> > I'm not a dm-thin expert, but I can describe the constraints for
> > using REQ_FUA because I think they probably apply to dm-thin, too
> > 
> > For filesystems, there are per-inode metadata trees, and that's all
> > we need to be stable w.r.t. O_DSYNC data writes. If the per-inode
> > metadata is up-to-date on disk (i.e. we are doing a pure overwrite
> > and no metadata needs to change to perform the data IO) then we can
> > issue a REQ_FUA write.  If there is any metadata that needs to be
> > modified to perform the write (e.g. we need an allocation, convert
> > an unwritten extent, etc) or something else dirtied the inode
> > metadata prior to the write, then we do a normal data write and a
> > post-write journal flush (which is REQ_PREFLUSH|REQ_FUA write) to
> > ensure both the data and the metadata that references it is on
> > stable storage.
> > 
> > See this commit:
> > 
> > commit 3460cac1ca76215a60acb086ebe97b3e50731628
> > Author: Dave Chinner <dchinner@xxxxxxxxxx>
> > Date:   Wed May 2 12:54:53 2018 -0700
> > 
> >     iomap: Use FUA for pure data O_DSYNC DIO writes
> > ....
> > 
> > Also, check the current upstream tree for IOMAP_F_DIRTY so the fs
> > can indicate that inode has O_DSYNC dirty metadata (allocate,
> > unwritten) or IOMAP_F_SHARED (requires COW), and the iomap
> > implementation uses the IOMAP_DIO_WRITE_FUA and IOMAP_DIO_NEED_SYNC
> > flags to determine what to do at IO completion.
> > 
> > I suspect that dm-thin would need to do something similar for
> > REQ_FUA writes - if the thin-dev metadata (rather than the global
> > pool metadata) is clean and there's an already allocated block being
> > overwritten into, then you can just pass the FUA down and not have
> > to run a global transaction commit.  If there's anything dirty, or
> > the IO requires COW or allocation, then it needs to run through the
> > existing slow path...
> > 
> > > > > This limits the number of active
> > > > > thin devices.  Similarly if different thin devices are performing
> > > > > operations that allocate new data space (provisioning, COW), then there
> > > > > can be contention in the allocators.
> > > > 
> > > > Or we are doing concurrent IO to the same thin device...
> > > 
> > > Concurrent IO that is causing a denial-of-service due to floods of
> > > REQ_FUA IO?  Yeap.. that sounds like it'd hurt ;)
> > 
> > This is what enterprise applications like SAP, MS-SQL, Seastar, etc
> > all do to extract maximum performance out of the underlying storage.
> > They can have hundreds, even thousands of O_DSYNC AIO+DIO in flight at
> > once, generally only limited only by software and hardware queue depths.
> > 
> > So, yeah, if "general use" is the goal, this better not be a DOS
> > vector. ;)
> 
> Just bumping this thread because it has been over 3 years now and
> dm-thinp hasn't seen any changes to address its heavy handling of
> REQ_FUA.
> 
> Mike