Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Atomic Writes

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Thu, 13 Feb 2020 20:42:42 -0800

On Thu, Feb 13, 2020 at 03:33:08PM -0700, Allison Collins wrote:
> Hi all,
> 
> I know there's a lot of discussion on the list right now, but I'd like to
> get this out before too much time gets away.  I would like to propose the
> topic of atomic writes.  I realize the topic has been discussed before, but
> I have not found much activity for it recently so perhaps we can revisit it.
> We do have a customer who may have an interest, so I would like to discuss
> the current state of things, and how we can move forward.  If efforts are in
> progress, and if not, what have we learned from the attempt.
> 
> I also understand there are multiple ways to solve this problem that people
> may have opinions on.  I've noticed some older patch sets trying to use a
> flag to control when dirty pages are flushed, though I think our customer
> would like to see a hardware solution via NVMe devices.  So I would like to
> see if others have similar interests as well and what their thoughts may be.
> Thanks everyone!

Hmmm well there are a number of different ways one could do this--

1) Userspace allocates an O_TMPFILE file, clones all the file data to
it, makes whatever changes it wants (thus invoking COW writes), and then
calls some ioctl to swap the differing extent maps atomically.  For XFS
we have most of those pieces, but we'd have to add a log intent item to
track the progress of the remap so that we can complete the remap if the
system goes down.  This has potentially the best flexibility (multiple
processes can coordinate to stage multiple updates to non-overlapping
ranges of the file) but is also a nice foot bazooka.

2) Set O_ATOMIC on the file, ensure that all writes are staged via COW,
and defer the cow remap step until we hit the synchronization point.
When that happens, we persist the new mappings somewhere (e.g. well
beyond all possible EOF in the XFS case) and then start an atomic remap
operation to move the new blocks into place in the file.  (XFS would
still have to add a new log intent item here to finish the remapping if
the system goes down.)  Less foot bazooka but leaves lingering questions
like what do you do if multiple processes want to run their own atomic
updates?

(Note that I think you have some sort of higher level progress tracking
of the remap operation because we can't leave a torn write just because
the computer crashed.)

3) Magic pwritev2 API that lets userspace talk directly to hardware
atomic writes, though I don't know how userspace discovers what the
hardware limits are.   I'm assuming the usual sysfs knobs?

Note that #1 and #2 are done entirely in software, which makes them less
performant but OTOH there's effectively no limit (besides available
physical space) on how much data or how many non-contiguous extents we
can stage and commit.

--D

> Allison