On Fri, Feb 11, 2022 at 10:08:06AM +1100, Dave Chinner wrote: > On Thu, Feb 10, 2022 at 02:03:23PM -0500, Brian Foster wrote: > > On Wed, Feb 02, 2022 at 01:22:40PM +1100, Dave Chinner wrote: > > > On Mon, Jan 24, 2022 at 11:57:12AM -0500, Brian Foster wrote: > > > > On Wed, Jan 19, 2022 at 04:36:36PM -0800, Darrick J. Wong wrote: > > > > > > Of course if you wanted to recycle inactive inodes or do something else > > > > > > entirely, then it's probably not worth going down this path.. > > > > > > > > > > I'm a bit partial to /trying/ to recycle inactive inodes because (a) > > > > > it's less tangling with -fsdevel for you and (b) inode scans in the > > > > > online repair patchset got a little weird once xfs_iget lost the ability > > > > > to recycle _NEEDS_INACTIVE inodes... > > > > > > > > > > OTOH I tried to figure out how to deal with the lockless list that those > > > > > inodes are put on, and I couldn't figure out how to get them off the > > > > > list safely, so that might be a dead end. If you have any ideas I'm all > > > > > ears. :) > > > > > > > > > > > > > So one of the things that I've been kind of unclear on about the current > > > > deferred inactivation implementation is the primary goal of the percpu > > > > optimization. I obviously can see the value of the percpu list in > > > > general, but how much processing needs to occur in percpu context to > > > > achieve the primary goal? > > > > > > > > For example, I can see how a single or small multi threaded sustained > > > > file removal might be batched efficiently, but what happens if said task > > > > happens to bounce around many cpus? > > > > > > In that case we have a scheduler problem, not a per-cpu > > > infrastructure issue. > > > > > > > Last I tested on my box, a single threaded rm -rf had executed on > > something like 24 of the 80 cpus available after about ~1m of runtime. > > Not surprising - the scheduler will select a sibling cores that > share caches when the previous CPU the task was running on is busy > running CPU affine tasks (i.e. the per-cpu inactive kworker thread). > > But how many CPUs it bounces the workload around over a long period > is not important. What is important is the cache residency between > the inodes when they are queued and when they are then inactivated. > That's measured in microseconds to single digit milliseconds (i.e. > within a single scheduling tick), and this is the metric that > matters the most. It doesn't matter if the first batch is unlinked > on CPU 1 and then inactived on CPU 1 while the scheduler moves the > unlink task to CPU 2 where it queues the next batch to be > inactivated on CPU 2, and so on. What matters is the data set in > each batch remains on the same CPU for inactivation processing. > > The tracing I've done indicates taht the majority of the time that > the scehduler bounces the tasks between two or three sibling CPUs > repeatedly. This occurs when the inactivation is keeping up with the > unlink queueing side. When we have lots of extents to remove in > inactivation, the amount of inactivation work is much greater than > the unlink() work, and so we see inactivation batch processing > taking longer and the CPU spread of the unlink task gets larger > because there are more CPUs running CPU-affine tasks doing > background inactivation. > > IOWs, having the number of CPUs the unlink task is scheduled on > grow over the long term is not at all unexpected - this is exactly > what we'd expect to see when we move towards async background > processing of complex operations... > > > Of course the inactivation work for an inode occurs on the cpu that the > > rm task was running on when the inode was destroyed, but I don't think > > there's any guarantee that kworker task will be the next task selected > > on the cpu if the system is loaded with other runnable tasks (and then > > whatever task is selected doesn't pollute the cache). > > The scheduler will select the next CPU based on phsyical machine > topology - core latencies, shared caches, numa distances, whether > the target CPU has affinity bound tasks already queued, etc. > > > For example, I > > also noticed rm-<pidX> -> rm-<pidY> context switching on concurrent > > remove tests. That is obviously not a caching issue in this case because > > both tasks are still doing remove work, but behavior is less > > deterministic of the target task happens to be something unrelated. Of > > course, that doesn't mean the approach might not otherwise be effective > > for the majority of common workloads.. > > As long as the context switch rate does not substantially increase, > having tasks migrate between sibling cores every so often isn't a > huge problem. > > > > That per-ag based back end processing is exactly what Darrick's > > > original proposals did: > > > > > > https://lore.kernel.org/linux-xfs/162758431072.332903.17159226037941080971.stgit@magnolia/ > > > > > > It used radix tree walks run in background kworker threads to batch > > > all the inode inactivation in a given AG via radix tree walks to > > > find them. > > > > > > It performed poorly at scale because it destroyed the CPU affinity > > > between the unlink() operation and the inactivation operation of the > > > unlinked inode when the last reference to the inode is dropped and > > > the inode evicted from task syscall exit context. REgardless of > > > whether there is a per-cpu front end or not, the per-ag based > > > processing will destroy the CPU affinity of the data set being > > > processed because we cannot guarantee that the per-ag objects are > > > all local to the CPU that is processing the per-ag objects.... > > > > > > > Ok. The role/significance of cpu caching was not as apparent to me when > > I had last replied to this thread. The discussion around the rcu inode > > lifecycle issue helped clear some of that up. > > > > That said, why not conditionally tag and divert to a background worker > > when the inodegc is disabled? That could allow NEEDS_INACTIVE inodes to > > be claimed/recycled from other contexts in scenarios like when the fs is > > frozen, since they won't be stuck in inaccessible and inactive percpu > > queues, but otherwise preserves current behavior in normal runtime > > conditions. Darrick mentioned online repair wanting to do something > > similar earlier, but it's not clear to me if scrub could or would want > > to disable the percpu inodegc workers in favor of a temporary/background > > mode while repair is running. I'm just guessing that performance is > > probably small enough of a concern in that situation that it wouldn't be > > a mitigating factor. Hm? > > WE probably could do this, but I'm not sure the complexity is > justified by the rarity of the problem it is trying to avoid. > Freezes are not long term, nor are they particularly common for > performance sensitive workloads. Hence I'm just not this corner case > is important enough to justify doing the work given that we've had > similar freeze-will-delay-some-stuff-indefinitely behaviour for a > long time... We /do/ have a few complaints lodged about hangcheck warnings when the filesystem has to be frozen for a very long time. It'd be nice to unblock the callers that want to grab a still-reachable inode, though. As for the online repair case that Brian mentioned, I've largely eliminated the need for scrub freezes by using jump labels, notifier chains, and an in-memory shadow btree to make it so that full fs scans can run in the background while the rest of the fs continues running. That'll be discussed ... soon, when I get to the point where I'm ready to start a full design review. (Hilariously, between that and better management of the DONTCACHE flag, I don't actually need deferred inactivation for repair anymore... :P) --D > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx