On Wed, Jan 26, 2022 at 09:36:07AM +1100, Dave Chinner wrote: > On Tue, Jan 25, 2022 at 06:40:44AM -0800, Paul E. McKenney wrote: > > On Tue, Jan 25, 2022 at 11:31:20AM +1100, Dave Chinner wrote: > > > > Ok, I think we're talking about slightly different things. What I mean > > > > above is that if a task removes a file and goes off doing unrelated > > > > $work, that inode will just sit on the percpu queue indefinitely. That's > > > > fine, as there's no functional need for us to process it immediately > > > > unless we're around -ENOSPC thresholds or some such that demand reclaim > > > > of the inode. > > > > > > Yup, an occasional unlink sitting around for a while on an unlinked > > > list isn't going to cause a performance problem. Indeed, such > > > workloads are more likely to benefit from the reduced unlink() > > > syscall overhead and won't even notice the increase in background > > > CPU overhead for inactivation of those occasional inodes. > > > > > > > It sounds like what you're talking about is specifically > > > > the behavior/performance of sustained file removal (which is important > > > > obviously), where apparently there is a notable degradation if the > > > > queues become deep enough to push the inode batches out of CPU cache. So > > > > that makes sense... > > > > > > Yup, sustained bulk throughput is where cache residency really > > > matters. And for unlink, sustained unlink workloads are quite > > > common; they often are something people wait for on the command line > > > or make up a performance critical component of a highly concurrent > > > workload so it's pretty important to get this part right. > > > > > > > > Darrick made the original assumption that we could delay > > > > > inactivation indefinitely and so he allowed really deep queues of up > > > > > to 64k deferred inactivations. But with queues this deep, we could > > > > > never get that background inactivation code to perform anywhere near > > > > > the original synchronous background inactivation code. e.g. I > > > > > measured 60-70% performance degradataions on my scalability tests, > > > > > and nothing stood out in the profiles until I started looking at > > > > > CPU data cache misses. > > > > > > > > > > > > > ... but could you elaborate on the scalability tests involved here so I > > > > can get a better sense of it in practice and perhaps observe the impact > > > > of changes in this path? > > > > > > The same conconrrent fsmark create/traverse/unlink workloads I've > > > been running for the past decade+ demonstrates it pretty simply. I > > > also saw regressions with dbench (both op latency and throughput) as > > > the clinet count (concurrency) increased, and with compilebench. I > > > didn't look much further because all the common benchmarks I ran > > > showed perf degradations with arbitrary delays that went away with > > > the current code we have. ISTR that parts of aim7/reaim scalability > > > workloads that the intel zero-day infrastructure runs are quite > > > sensitive to background inactivation delays as well because that's a > > > CPU bound workload and hence any reduction in cache residency > > > results in a reduction of the number of concurrent jobs that can be > > > run. > > > > Curiosity and all that, but has this work produced any intuition on > > the sensitivity of the performance/scalability to the delays? As in > > the effect of microseconds vs. tens of microsecond vs. hundreds of > > microseconds? > > Some, yes. > > The upper delay threshold where performance is measurably > impacted is in the order of single digit milliseconds, not > microseconds. > > What I saw was that as the batch processing delay goes beyond ~5ms, > IPC starts to fall. The CPU usage profile does not change shape, nor > does the proportions of where CPU time is spent change. All I see if > data cache misses go up substantially and IPC drop substantially. If > I read my notes correctly, typical change from "fast" to "slow" in > IPC was 0.82 to 0.39 and LLC-load-misses from 3% to 12%. The IPC > degradation was all done by the time the background batch processing > times were longer than a typical scheduler tick (10ms). > > Now, I've been testing on Xeon CPUs with 36-76MB of l2-l3 caches, so > there's a fair amount of data that these can hold. I expect that > with smaller caches, the inflection point will be at smaller batch > sizes rather than more. Hence while I could have used larger batches > for background processing (e.g. 64-128 inodes rather than 32), I > chose smaller batch sizes by default so that CPUs with smaller > caches are less likely to be adversely affected by the batch size > being too large. OTOH, I started to measure noticable degradation by > batch sizes of 256 inodes on my machines, which is why the hard > queue limit got set to 256 inodes. > > Scaling the delay/batch size down towards single inode queuing also > resulted in perf degradation. This was largely because of all the > extra scheduling overhead that trying to switching between user task > and kernel worker task for every inode entailed. Context switch rate > went from a couple of thousand/sec to over 100,000/s for single > inode batches, and performance went backwards in proportion with the > amount of CPU then spent on context switches. It also lead to > increases in buffer lock contention (hence context switches) as both > user task and kworker try to access the same buffers... Makes sense. Never a guarantee of easy answers. ;-) If it would help, I could create expedited-grace-period counterparts of get_state_synchronize_rcu(), start_poll_synchronize_rcu(), poll_state_synchronize_rcu(), and cond_synchronize_rcu(). These would provide sub-millisecond grace periods, in fact, sub-100-microsecond grace periods on smaller systems. Of course, nothing comes for free. Although expedited grace periods are way way cheaper than they used to be, they still IPI non-idle non-nohz_full-userspace CPUs, which translates to roughly the CPU overhead of a wakeup on each IPIed CPU. And of course disruption to aggressive non-nohz_full real-time applications. Shorter latencies also translates to fewer updates over which to amortize grace-period overhead. But it should get well under your single-digit milliseconds of delay. Thanx, Paul