Re: write atomicity with xfs ... current status?

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 17 Mar 2020 10:32:40 +1100

On Mon, Mar 16, 2020 at 02:59:13PM -0700, Darrick J. Wong wrote:
> On Mon, Mar 16, 2020 at 08:59:54PM +0000, Ober, Frank wrote:
> > Hi, Intel is looking into does it make sense to take an existing,
> > popular filesystem and patch it for write atomicity at the sector
> > count level. Meaning we would protect a configured number of sectors
> > using parameters that each layer in the kernel would synchronize on.
> >  We could use a parameter(s) for this that comes from the NVMe
> > specification such as awun or awunpf
> 
> <gesundheit>
> 
> Oh, that was an acronym...
> 
> > that set across the (affected)
> > layers to a user space program such as innodb/MySQL which would
> > benefit as would other software. The MySQL target is a strong use
> > case, as its InnoDB has a double write buffer that could be removed if
> > write atomicity was protected at 16KiB for the file opens and with
> > fsync(). 
> 
> We probably need a better elaboration of the exact usecases of atomic
> writes since I haven't been to LSF in a couple of years (and probably
> not this year either).  I can think of a couple of access modes off the
> top of my head:
> 
> 1) atomic directio write where either you stay under the hardware atomic
> write limit and we use it, or...

We've plumbed RWF_DSYNC to use REQ_FUA IO for pure overwrites if the
hardware supports it. We can do exactly the same thing for
RWF_ATOMIC - it succeeds if:

- we can issue it as a single bio
- the lower layers can take the entire atomic bio without splitting
  it.
- we treat O_ATOMIC as O_DSYNC so that any metadata changes required
  also get synced to disk before signalling IO completion. If no
  metadata updates are required, then it's an open question as to
  whether REQ_FUA is also required with REQ_ATOMIC...

Anything else returns a "atomic write IO not possible" error.

> 2) software atomic writes where we use the xfs copy-on-write mechanism
> to stage the new blocks and later map them back into the inode, where
> "later" is either an explicit fsync or an O_SYNC write or something...

That's a possible fallback, but we can't guarantee that the write
will be atomic - partial write failure can still occur as page cache
writeback can be split into arbitrary IOs and transactions....

> 3) ...or a totally separate interface where userspace does something
> along the lines of:
> 
> 	write_fd = stage_writes(fd);
> 
> which creates an O_TMPFILE and reflinks all of fd's content to it
> 
> 	write(write_fd...);
> 
> 	err = commit_writes(write_fd, fd);
> 
> which then uses extent remapping to push all the changed blocks back to
> the original file if it hasn't changed.  Bonus: other threads don't see
> the new data until commit_writes() finishes, and we can introduce new
> log items to make sure that once we start committing we can finish it
> even if the system goes down.

Which is essentially userspace library code that runs multiple
syscalls to do the necessary work. commit_writes() is basically a
ranged swap-extents call. i.e.:

	write_fd = open(O_TMPFILE)
	clone_file_range(fd, writefd, /* overwrite range */)
	loop (overwrite range) {
		write(write_fd)
	}
	fsync(write_fd)
	swap_extents(fd, write_fd, /* overwrite range */)
	fsync(fd)

i.e. this is basically the same process as a partial file defrag
operation. Hence I don't think the kernel needs to be involved in
the software emulation of atomic writes at all. IOWs, if the kernel
returns an "cannot do an atomic write" error to RWF_ATOMIC,
userspace can simply do the slow atomic overwrite as per above
without needing any special kernel code at all...

> > My question is why hasn't xfs write atomicity advanced further, as it
> > seems in 3.x kernel time a few years ago this was tried but nothing
> > committed. as documented here:
> >
> >                http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/O_ATOMIC
> > 
> > Is xfs write atomicity still being pursued , and with what design
> > objective. There is a long thread here,
> > https://lwn.net/Articles/789600/ on write atomicity, but with no
> > progress, lots of ideas in there but not any progress, but I am
> > unclear.
> > 
> > Is my design idea above simply too simplistic, to try and protect a
> > configured block size (sector count) through the filesystem and block
> > layers, and what really is not making it attainable?
> 
> Lack of developer time, AFAICT.

There's multiple other things, I think:

1. no hardware that provides usable atomic write semantics.
2. no device or block layer support for atomic write IOs; we need
   IO level infrastructure before the filesystems can do anything
   useful
3. no support in page cache for tracking atomic write ranges, so
   atomic writes via buffered IO rather difficult without using
   temporary files and extent swapping tricks...
4. emulation in userspace is easy if you have clone_file_range()
   support, even if it is slow. We aren't hearing from app
   developers emulating atomic writes for kernel side acceleration
   because it won't work on ext4.

Once we get 1. and 2., then we can support atomic direct IO writes
through XFS via RWF_ATOMIC with relative ease. 4) probably requires
some mods to XFS's swap_extent function to properly support file
ranges. The API supports ranges, the implementation ony supports
"full file range"...

Cheers,

Dave.

-- 
Dave Chinner
david@xxxxxxxxxxxxx