[one more time with feeling, after fixing dm-devel address typo] On Tue, Apr 23, 2024 at 01:19:39PM -0400, Mike Snitzer wrote: > [adding snitzer@xxxxxxxxxx and dm-devel to the Cc so I don't lose this important thread again] > [this was old Red Hat internal discussion that I tripped over like a dead cat, but there isn't anything sensitive here: dm-thinp needs to improve] > > On Sun, Jan 24 2021 at 8:34P -0500, > Dave Chinner <dchinner@xxxxxxxxxx> wrote: > > > On Thu, Jan 21, 2021 at 11:25:37PM -0500, Mike Snitzer wrote: > > > On Thu, Jan 21 2021 at 9:33pm -0500, > > > Dave Chinner <dchinner@xxxxxxxxxx> wrote: > > > > > > > On Thu, Jan 21, 2021 at 04:23:20PM +0000, Joe Thornber wrote: > > > > > On Thu, 2021-01-21 at 17:45 +1100, Dave Chinner wrote: > > > > > > On Wed, Jan 20, 2021 at 10:34:01AM +0000, Joe Thornber wrote: > > > > > > > > > > > > > - Currently the tools unshare _metadata_ when restoring/repairing > > > > > > > pools > > > > > > > with snapshots. Combined with the hard 16gb limit on metadata size > > > > > > > this means some systems with many snapshots can't be repaired if > > > > > > > they > > > > > > > run into trouble. Next release of the tools will fix, but as said > > > > > > > before, we have to assess the software as is. > > > > > > > > > > > > Can you explain more about this limit and what it means? The test > > > > > > I'm currently running (1000 iterations of "write 10,000 4kB IOs, > > > > > > snapshot" > > > > > > > > > > Right let's talk about your scenario. > > > > > > > > > > Firstly metadata. Our transactional model is based on persistent data > > > > > structures in the functional programming sense. To update metadata in > > > > > a new transaction we COW key metadata pages to build a new data > > > > > structure that shares data with that of the previous transaction. We > > > > > amortise the costs associated with this by holding the transactions > > > > > open as long as possible and allowing real mutation *within* a > > > > > transaction. > > > > > > > > > > There are advantages and disadvantages with this scheme. > > > > > > > > > > Advantages: It's simple. It allows the 'take snapshot' operation to be > > > > > very quick (just writing a handful of pages). It also allows us to > > > > > take a 'metadata snapshot' of all the metadata which userland can > > > > > access safely. So we can move the implementation of certain > > > > > query/health functions entirely to userland. (see thin_check, thin_ls, > > > > > thin_dump etc). > > > > > > > > > > Disadvantages: The biggest down side is everything is done in one > > > > > global transaction; if a commit is required by one thin device (eg, due > > > > > to a REQ_FUA), then other active thin devices will be temporarily > > > > > slowed as the transaction commits. > > > > > > > > OUCH! > > > > > > > > That means every journal IO from the filesystem on the thin vol will > > > > trigger a globally serialising metadata commit. It also means every > > > > user O_DSYNC/RWF_DSYNC direct IO (which we optimise to FUA writes > > > > to avoid journal flushes) will trigger a thinp commit. > > > > > > So XFS hijacks REQ_FUA to bypass its journaling _but_ expects that > > > REQ_FUA to have no impact on the underlying storage? Why not clear > > > REQ_FUA after XFS decides to bypass its journal? Why use REQ_FUA for > > > this? What are your assumptions about how the underlying block device > > > will respond? > > > > "Hijack" is a bit ... strong. We replaced a full device cache flush > > with an FUA write at the request of a partner that demonstrated that > > using FUA writes in their applications on other operating systems on > > the same hardware configurations ran them 25-50% faster than on > > RHEL/XFS. > > > > I reproduced their results locally, and the difference in > > performance was the device level overhead between write+full device > > caceh flush (the traditional linux DIO O_DSYNC implementation) and > > using FUA writes to avoid the full device cache flush when only the > > data in the IO itself needed to be written to stable storage. > > > > It's not at all controversial - REQ_FUA has been used by filesystems > > for years for their journal writes to optimise away the post-write > > cache flush on hardware that supports FUA. And ext4, gfs2, zonefs > > and btrfs all use this same code direct IO code path now, so it's > > not just XFS doing this... > > > > FWIW, I mused on the weekend that we don't even need to check that > > the block device supports FUA to make this optimisation, because the > > block layer will convert REQ_FUA to a normal write + post write > > cache flush if the underlying device doesn't support FUA. So we are > > likely to make all pure O_DSYNC data overwrites use REQ_FUA in the > > future, not just on devices that support FUA... > > > > > > This is what enterprise applications do for IO performance - small, > > > > random, highly concurrent O_DSYNC AIO+DIO overwrites - so it seems > > > > like this is going to cause IO serialisation problems even if there is > > > > only one thin-vol in a thin-pool... > > > > > > Yes, but I currently cannot say this is thinp's fault. REQ_FUA must be > > > honored. If dm-thinp is being asked to flush caches and get a > > > consistent point-in-time it cannot be ignored. > > > > Yes, that's the point I was trying to make. FUA is not a cache flush > > request - it's much more fine grained than that, and so can be used > > to elide full device cache flushes. Hence requiring global commits > > to honor FUA data writes defeats the higher level optimisations that > > try to minimise the device cache flush overhead... > > > > > So what additonal data structures and approach should we be taking to > > > mask the impact of needing to keep data and metadata consistent? > > > > I'm not a dm-thin expert, but I can describe the constraints for > > using REQ_FUA because I think they probably apply to dm-thin, too > > > > For filesystems, there are per-inode metadata trees, and that's all > > we need to be stable w.r.t. O_DSYNC data writes. If the per-inode > > metadata is up-to-date on disk (i.e. we are doing a pure overwrite > > and no metadata needs to change to perform the data IO) then we can > > issue a REQ_FUA write. If there is any metadata that needs to be > > modified to perform the write (e.g. we need an allocation, convert > > an unwritten extent, etc) or something else dirtied the inode > > metadata prior to the write, then we do a normal data write and a > > post-write journal flush (which is REQ_PREFLUSH|REQ_FUA write) to > > ensure both the data and the metadata that references it is on > > stable storage. > > > > See this commit: > > > > commit 3460cac1ca76215a60acb086ebe97b3e50731628 > > Author: Dave Chinner <dchinner@xxxxxxxxxx> > > Date: Wed May 2 12:54:53 2018 -0700 > > > > iomap: Use FUA for pure data O_DSYNC DIO writes > > .... > > > > Also, check the current upstream tree for IOMAP_F_DIRTY so the fs > > can indicate that inode has O_DSYNC dirty metadata (allocate, > > unwritten) or IOMAP_F_SHARED (requires COW), and the iomap > > implementation uses the IOMAP_DIO_WRITE_FUA and IOMAP_DIO_NEED_SYNC > > flags to determine what to do at IO completion. > > > > I suspect that dm-thin would need to do something similar for > > REQ_FUA writes - if the thin-dev metadata (rather than the global > > pool metadata) is clean and there's an already allocated block being > > overwritten into, then you can just pass the FUA down and not have > > to run a global transaction commit. If there's anything dirty, or > > the IO requires COW or allocation, then it needs to run through the > > existing slow path... > > > > > > > This limits the number of active > > > > > thin devices. Similarly if different thin devices are performing > > > > > operations that allocate new data space (provisioning, COW), then there > > > > > can be contention in the allocators. > > > > > > > > Or we are doing concurrent IO to the same thin device... > > > > > > Concurrent IO that is causing a denial-of-service due to floods of > > > REQ_FUA IO? Yeap.. that sounds like it'd hurt ;) > > > > This is what enterprise applications like SAP, MS-SQL, Seastar, etc > > all do to extract maximum performance out of the underlying storage. > > They can have hundreds, even thousands of O_DSYNC AIO+DIO in flight at > > once, generally only limited only by software and hardware queue depths. > > > > So, yeah, if "general use" is the goal, this better not be a DOS > > vector. ;) > > Just bumping this thread because it has been over 3 years now and > dm-thinp hasn't seen any changes to address its heavy handling of > REQ_FUA. > > Mike