On Mon, Aug 14, 2017 at 04:56:53PM -0700, Darrick J. Wong wrote: > The biggest complaint I have about this patch series is that we end up > using separate transactions for each phase of the link/unlink > operations. There's no guarantee that once we start the process of > expand-dir, add-name-to-dir, log-inode-cores, expand-attr, > add-pptr-to-attrs that we actually get all the way to the end of that > process, particularly if we crash in the middle and have to recover the > log coming back up. Yes, that's a problem that we had not been solved at the point in time I wrote that patchset. After making the parent pointer attributes be added and removed correctly, then next thing to do was to make them atomic and replayable. And that's the bit I never got to. > Obviously, back in 2014 when Dave started on these patches there was no > defer_ops, so a lot of effort goes into passing parameters up and down > the stack and fiddling with the transaction reservation type > declarations in order to guarantee that we can shove everything into a > single transaction. Now that we have a more general mechanism to split > complex updates into a chain of smaller updates (and make log recovery > pick up where we left off) it's time to figure out how to make attr (and > maybe dir) updates a deferrable operation. Right. We've got the infrastructure now, but it's nasty to retrofit to all the attr code. > This adds complexity to the log because we have to define an "attr add > KEY:VALUE" and a "attr delete KEY" deferred op that hooks into the > existing attr code. For parent pointers it's not a big deal to log > values since they're not big, but for the general case this idea needs > more thought, or a decision that we'll just keep doing things the same > way. The problem with this is that it requires us to log the entire attribute value, which could be 64k in size. That used to be a big issue because log bandwidth was at a premium. That isn't so much a problem now, so logging the attr value and getting rid of the unlogged sync write for remote attr updates is the way I think we should head. > Anyway, assuming the existence of a "attr delete KEY" log intent item we > no longer have to pass the offsets around everywhere because instead > _dir_removename attaches the defer item to the defer_ops and > _defer_finish takes care of rolling through the updates. > > Thoughts? Yup, it will simplify some things. However, if we're going to start converting complex operations into smaller, simpler ops chained via intents and deferred ops, lets let's take a step back here and think about how we'd like them to be defined, maintained and executed. If we're going to take a big step to convert the dir/attr code to run from deferred ops, I want to make sure we can make the most of it in future. FYI, what I'd really like to do is to be able to statically define the individual operations needed to make a modification as a set of deferred ops sorta like the way we build up transaction reservations. I'm thinking of defer ops being somewhat akin to a state machine in definition and implementation. i.e. rather than being opaque things that are built up by dynamic events, they become mostly static ops with some dynamically driven events. Hence a directory addition is simply a case of selecting the correct defer_ops structure and starting it to run on the directory inode and new dirent name. The high level code would need to check what directory features are enabled to select the ops that need to run, but otherwise there's nothing more to do but hand it to the xfs_deferops_start(). Parent pointers then just becomes a different set of deferops. If possible, it'd be nice to use the definition of each op to determine the log/block reservations needed for the op to run, too. Expanding this filesystem wide, we'd end up with an engine that runs defer ops in the most efficient manner possible and a bunch of smaller, simpler operations linked by intent/done log entries. These ops could be run sync or async by the engine, which would give us a path towards async metadata ops and bulk metadata update optimisations... Yeah, I'm getting way ahead of myself here, but I think you get the idea. Does that even remotely sound plausible, or have I just been hitting the hard stuff too much recently? Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html