The idea of using O_TMPFILE is interesting ... but opening an O_TMPFILE is awkward for network file systems because it is not an atomic operation either ... (create/close then open) On Thu, Feb 13, 2020 at 10:43 PM Darrick J. Wong <darrick.wong@xxxxxxxxxx> wrote: > > On Thu, Feb 13, 2020 at 03:33:08PM -0700, Allison Collins wrote: > > Hi all, > > > > I know there's a lot of discussion on the list right now, but I'd like to > > get this out before too much time gets away. I would like to propose the > > topic of atomic writes. I realize the topic has been discussed before, but > > I have not found much activity for it recently so perhaps we can revisit it. > > We do have a customer who may have an interest, so I would like to discuss > > the current state of things, and how we can move forward. If efforts are in > > progress, and if not, what have we learned from the attempt. > > > > I also understand there are multiple ways to solve this problem that people > > may have opinions on. I've noticed some older patch sets trying to use a > > flag to control when dirty pages are flushed, though I think our customer > > would like to see a hardware solution via NVMe devices. So I would like to > > see if others have similar interests as well and what their thoughts may be. > > Thanks everyone! > > Hmmm well there are a number of different ways one could do this-- > > 1) Userspace allocates an O_TMPFILE file, clones all the file data to > it, makes whatever changes it wants (thus invoking COW writes), and then > calls some ioctl to swap the differing extent maps atomically. For XFS > we have most of those pieces, but we'd have to add a log intent item to > track the progress of the remap so that we can complete the remap if the > system goes down. This has potentially the best flexibility (multiple > processes can coordinate to stage multiple updates to non-overlapping > ranges of the file) but is also a nice foot bazooka. > > 2) Set O_ATOMIC on the file, ensure that all writes are staged via COW, > and defer the cow remap step until we hit the synchronization point. > When that happens, we persist the new mappings somewhere (e.g. well > beyond all possible EOF in the XFS case) and then start an atomic remap > operation to move the new blocks into place in the file. (XFS would > still have to add a new log intent item here to finish the remapping if > the system goes down.) Less foot bazooka but leaves lingering questions > like what do you do if multiple processes want to run their own atomic > updates? > > (Note that I think you have some sort of higher level progress tracking > of the remap operation because we can't leave a torn write just because > the computer crashed.) > > 3) Magic pwritev2 API that lets userspace talk directly to hardware > atomic writes, though I don't know how userspace discovers what the > hardware limits are. I'm assuming the usual sysfs knobs? > > Note that #1 and #2 are done entirely in software, which makes them less > performant but OTOH there's effectively no limit (besides available > physical space) on how much data or how many non-contiguous extents we > can stage and commit. > > --D > > > Allison -- Thanks, Steve