Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Atomic Writes

Steve French <smfrench@xxxxxxxxx> · Thu, 20 Feb 2020 15:30:42 -0600

The idea of using O_TMPFILE is interesting ... but opening an
O_TMPFILE is awkward for network file systems because it is not an
atomic operation either ... (create/close then open)

On Thu, Feb 13, 2020 at 10:43 PM Darrick J. Wong
<darrick.wong@xxxxxxxxxx> wrote:
>
> On Thu, Feb 13, 2020 at 03:33:08PM -0700, Allison Collins wrote:
> > Hi all,
> >
> > I know there's a lot of discussion on the list right now, but I'd like to
> > get this out before too much time gets away.  I would like to propose the
> > topic of atomic writes.  I realize the topic has been discussed before, but
> > I have not found much activity for it recently so perhaps we can revisit it.
> > We do have a customer who may have an interest, so I would like to discuss
> > the current state of things, and how we can move forward.  If efforts are in
> > progress, and if not, what have we learned from the attempt.
> >
> > I also understand there are multiple ways to solve this problem that people
> > may have opinions on.  I've noticed some older patch sets trying to use a
> > flag to control when dirty pages are flushed, though I think our customer
> > would like to see a hardware solution via NVMe devices.  So I would like to
> > see if others have similar interests as well and what their thoughts may be.
> > Thanks everyone!
>
> Hmmm well there are a number of different ways one could do this--
>
> 1) Userspace allocates an O_TMPFILE file, clones all the file data to
> it, makes whatever changes it wants (thus invoking COW writes), and then
> calls some ioctl to swap the differing extent maps atomically.  For XFS
> we have most of those pieces, but we'd have to add a log intent item to
> track the progress of the remap so that we can complete the remap if the
> system goes down.  This has potentially the best flexibility (multiple
> processes can coordinate to stage multiple updates to non-overlapping
> ranges of the file) but is also a nice foot bazooka.
>
> 2) Set O_ATOMIC on the file, ensure that all writes are staged via COW,
> and defer the cow remap step until we hit the synchronization point.
> When that happens, we persist the new mappings somewhere (e.g. well
> beyond all possible EOF in the XFS case) and then start an atomic remap
> operation to move the new blocks into place in the file.  (XFS would
> still have to add a new log intent item here to finish the remapping if
> the system goes down.)  Less foot bazooka but leaves lingering questions
> like what do you do if multiple processes want to run their own atomic
> updates?
>
> (Note that I think you have some sort of higher level progress tracking
> of the remap operation because we can't leave a torn write just because
> the computer crashed.)
>
> 3) Magic pwritev2 API that lets userspace talk directly to hardware
> atomic writes, though I don't know how userspace discovers what the
> hardware limits are.   I'm assuming the usual sysfs knobs?
>
> Note that #1 and #2 are done entirely in software, which makes them less
> performant but OTOH there's effectively no limit (besides available
> physical space) on how much data or how many non-contiguous extents we
> can stage and commit.
>
> --D
>
> > Allison

-- 
Thanks,

Steve