On Sat, Aug 17, 2019 at 09:20:06AM -0400, Brian Foster wrote: > On Sat, Aug 17, 2019 at 11:40:23AM +1000, Dave Chinner wrote: > > I like this patch because it means we are starting to reach the > > end-game of this architectural change. This patch indicates that > > people are starting to understand the end goal of this work: to > > break up big transactions into atomic chains of smaller, simpler > > linked transactions. And they are doing so without needing to be > > explicitly told "this is how we want complex modifications to be > > done". This is _really good_. :) > > > > And that leads me to start thinking about the next step after that, > > which I'd always planned it to be, and that is async processing of > > the "atomic multi-transaction operations". That, at the time, was > > based on the observation that we had supercomputers with thousands > > of CPUs banging on the one filesystem and we always had CPUs to > > spare. That's even more true these days: lots of filesytem > > operations still single threaded so we have huge amounts of idle CPU > > to spare. We could be using that to speed up things like rsync, > > tarball extraction, rm -rf, etc. > > > > I haven't read back through the links yet, but on a skim the "async" > part of this sounds like a gap in what is described in the sections > referenced above (which sounds more like changing log formats to > something more logical than physical). I'm pretty familiar with all of > the dfops bits to this point, the async bit is what I'm asking about... > > What exactly are you thinking about making async that isn't already? Are > you talking about separating in-core changes from backend > ordering/logging in general and across the board? Yup, separating the work we have to do from the process context that needs it to be done. Think about a buffered write. All we need to do in process context is reserve space and copy the data into the kernel. The rest of it is done asynchornously in the background, and can be expedited by fsync(). Basically applying that to create, rename, etc. It's more complex because we have to guarantee ordering of operations, but fundamentally there is nothing stopping us from doing something liek this on create: here's a synchronous create, but with async transaction processing: DEFINE_WAIT(wait); trans alloc lock dir inode log intent { dir = dp op = file create name = <xfs_name> mode = mode wait = wait } xfs_defer_finish(intent, wait) -> commits intent -> punts rest of work to worker thread -> when all is done, will wakeup(wait) -> sleeps on wait unlock dir This could eventually become an async create by restructuring it kinda like this: ip = xfs_inode_alloc(); <initialise and set up inode, leave XFS_INEW/I_NEW set> grab dir sequence number trans alloc log intent { dir = dp seq = dir_seq op = file create name = <xfs_name> mode = mode ip = ip } xfs_defer_finish(intent) -> commits intent -> punts rest of creation work to worker thread when complete, will clear XFS_INEW/I_NEW return instantiated inode to caller Anyone one who looks this inode up after creation will block on XFS_INEW/I_NEW flag bits. The caller that created the inode will be able to operate on it straight away.... SO converting to async processing is really requires several steps. 1. convert everything to intent logging and defer operations 2. start every modification with an intent and commit 3. add wait events to each dfops chain 4. run dfops in worker threads, calling wakeups when done 5. convert high level code to do in-core modifications, dfops runs on-disk transactions only 6. get rid of high level waits for ops that don't need to wait for transactional changes. > Or opportunistically > making certain deferred operations async if the result of such > operations is not required to be complete by the time the issuing > operation returns to userspace? Well, that's obvious for things like unlink. But what such async processing allows is things like bulk directory modifications (e.g. rm -rf detection because the dir inode gets unlinked before we've started processing any of the dirent removal ops) which can greatly speed up operations. e.g. rm -rf becomes "load all the inodes into memory as we log dirent removal, when the dir unlink is logged, truncate the dir inode they are all gone. Sort all the inodes into same cluster/chunk groups, free all the inodes in a single inobt/finobt record update...." IOWs, moving to intent based logging allows us to dynamically change the way we do operations - the intent defines what needs to be done, but it doesn't define how it gets done. As such, bulk processing optimisations become possible and those optimisations can be done completely independently of the front end that logs the initial intent. > For example, a hole punch needs to > modify the associated file before it returns, but we might not care if > the associated block freeing operation has completed or not before the > punch returns (as long as the intent is logged) because that's not a > hard requirement of the higher level operation. Whereas the current > behavior is that the extent free operation is deferred, but it is not > necessarily async at the operational level (i.e. the async logging > nature of the CIL notwithstanding). Hm? Yup, exactly. Nothing says the extent has to be free by the time the hole punch returns. The only rules we need to play by is that it looks to userspace like there's hole, and if they run fsync then there really is a hole. Otherwise the scheduling of the work is largely up to us. Split front/back async processing like this isn't new - it's something Daniel Phillips was trying to do with tux3. It deferred as much as it could to the back end processing threads and did as little as possible in the syscall contexts. See slide 14: https://events.static.linuxfound.org/sites/events/files/slides/tux3.linuxcon.pdf So the concept has largely been proven in other filesystems, it's just that if you don't design something from scratch to be asynchronous it can be difficult to retrofit... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx