On Thu, Feb 13, 2020 at 03:33:08PM -0700, Allison Collins wrote: > Hi all, > > I know there's a lot of discussion on the list right now, but I'd like to > get this out before too much time gets away. I would like to propose the > topic of atomic writes. I realize the topic has been discussed before, but > I have not found much activity for it recently so perhaps we can revisit it. > We do have a customer who may have an interest, so I would like to discuss > the current state of things, and how we can move forward. If efforts are in > progress, and if not, what have we learned from the attempt. > > I also understand there are multiple ways to solve this problem that people > may have opinions on. I've noticed some older patch sets trying to use a > flag to control when dirty pages are flushed, though I think our customer > would like to see a hardware solution via NVMe devices. So I would like to > see if others have similar interests as well and what their thoughts may be. > Thanks everyone! Hmmm well there are a number of different ways one could do this-- 1) Userspace allocates an O_TMPFILE file, clones all the file data to it, makes whatever changes it wants (thus invoking COW writes), and then calls some ioctl to swap the differing extent maps atomically. For XFS we have most of those pieces, but we'd have to add a log intent item to track the progress of the remap so that we can complete the remap if the system goes down. This has potentially the best flexibility (multiple processes can coordinate to stage multiple updates to non-overlapping ranges of the file) but is also a nice foot bazooka. 2) Set O_ATOMIC on the file, ensure that all writes are staged via COW, and defer the cow remap step until we hit the synchronization point. When that happens, we persist the new mappings somewhere (e.g. well beyond all possible EOF in the XFS case) and then start an atomic remap operation to move the new blocks into place in the file. (XFS would still have to add a new log intent item here to finish the remapping if the system goes down.) Less foot bazooka but leaves lingering questions like what do you do if multiple processes want to run their own atomic updates? (Note that I think you have some sort of higher level progress tracking of the remap operation because we can't leave a torn write just because the computer crashed.) 3) Magic pwritev2 API that lets userspace talk directly to hardware atomic writes, though I don't know how userspace discovers what the hardware limits are. I'm assuming the usual sysfs knobs? Note that #1 and #2 are done entirely in software, which makes them less performant but OTOH there's effectively no limit (besides available physical space) on how much data or how many non-contiguous extents we can stage and commit. --D > Allison