On Tue, Jan 25, 2022 at 11:31:20AM +1100, Dave Chinner wrote: > On Mon, Jan 24, 2022 at 06:29:18PM -0500, Brian Foster wrote: > > On Tue, Jan 25, 2022 at 09:08:53AM +1100, Dave Chinner wrote: > > > > FYI, I modified my repeated alloc/free test to do some batching and form > > > > it into something more able to measure the potential side effect / cost > > > > of the grace period sync. The test is a single threaded, file alloc/free > > > > loop using a variable per iteration batch size. The test runs for ~60s > > > > and reports how many total files were allocated/freed in that period > > > > with the specified batch size. Note that this particular test ran > > > > without any background workload. Results are as follows: > > > > > > > > files baseline test > > > > > > > > 1 38480 38437 > > > > 4 126055 111080 > > > > 8 218299 134469 > > > > 16 306619 141968 > > > > 32 397909 152267 > > > > 64 418603 200875 > > > > 128 469077 289365 > > > > 256 684117 566016 > > > > 512 931328 878933 > > > > 1024 1126741 1118891 > > > > > > Can you post the test code, because 38,000 alloc/unlinks in 60s is > > > extremely slow for a single tight open-unlink-close loop. I'd be > > > expecting at least ~10,000 alloc/unlink iterations per second, not > > > 650/second. > > > > > > > Hm, Ok. My test was just a bash script doing a 'touch <files>; rm > > <files>' loop. I know there was application overhead because if I > > tweaked the script to open an fd directly rather than use touch, the > > single file performance jumped up a bit, but it seemed to wash away as I > > increased the file count so I kept running it with larger sizes. This > > seems off so I'll port it over to C code and see how much the numbers > > change. > > Yeah, using touch/rm becomes fork/exec bound very quickly. You'll > find that using "echo > <file>" is much faster than "touch <file>" > because it runs a shell built-in operation without fork/exec > overhead to create the file. But you can't play tricks like that to > replace rm: > > $ time for ((i=0;i<1000;i++)); do touch /mnt/scratch/foo; rm /mnt/scratch/foo ; done > > real 0m2.653s > user 0m0.910s > sys 0m2.051s > $ time for ((i=0;i<1000;i++)); do echo > /mnt/scratch/foo; rm /mnt/scratch/foo ; done > > real 0m1.260s > user 0m0.452s > sys 0m0.913s > $ time ./open-unlink 1000 /mnt/scratch/foo > > real 0m0.037s > user 0m0.001s > sys 0m0.030s > $ > > Note the difference in system time between the three operations - > almost all the difference in system CPU time is the overhead of > fork/exec to run the touch/rm binaries, not do the filesystem > operations.... > > > > > That's just a test of a quick hack, however. Since there is no real > > > > urgency to inactivate an unlinked inode (it has no potential users until > > > > it's freed), > > > > > > On the contrary, there is extreme urgency to inactivate inodes > > > quickly. > > > > > > > Ok, I think we're talking about slightly different things. What I mean > > above is that if a task removes a file and goes off doing unrelated > > $work, that inode will just sit on the percpu queue indefinitely. That's > > fine, as there's no functional need for us to process it immediately > > unless we're around -ENOSPC thresholds or some such that demand reclaim > > of the inode. > > Yup, an occasional unlink sitting around for a while on an unlinked > list isn't going to cause a performance problem. Indeed, such > workloads are more likely to benefit from the reduced unlink() > syscall overhead and won't even notice the increase in background > CPU overhead for inactivation of those occasional inodes. > > > It sounds like what you're talking about is specifically > > the behavior/performance of sustained file removal (which is important > > obviously), where apparently there is a notable degradation if the > > queues become deep enough to push the inode batches out of CPU cache. So > > that makes sense... > > Yup, sustained bulk throughput is where cache residency really > matters. And for unlink, sustained unlink workloads are quite > common; they often are something people wait for on the command line > or make up a performance critical component of a highly concurrent > workload so it's pretty important to get this part right. > > > > Darrick made the original assumption that we could delay > > > inactivation indefinitely and so he allowed really deep queues of up > > > to 64k deferred inactivations. But with queues this deep, we could > > > never get that background inactivation code to perform anywhere near > > > the original synchronous background inactivation code. e.g. I > > > measured 60-70% performance degradataions on my scalability tests, > > > and nothing stood out in the profiles until I started looking at > > > CPU data cache misses. > > > > > > > ... but could you elaborate on the scalability tests involved here so I > > can get a better sense of it in practice and perhaps observe the impact > > of changes in this path? > > The same conconrrent fsmark create/traverse/unlink workloads I've > been running for the past decade+ demonstrates it pretty simply. I > also saw regressions with dbench (both op latency and throughput) as > the clinet count (concurrency) increased, and with compilebench. I > didn't look much further because all the common benchmarks I ran > showed perf degradations with arbitrary delays that went away with > the current code we have. ISTR that parts of aim7/reaim scalability > workloads that the intel zero-day infrastructure runs are quite > sensitive to background inactivation delays as well because that's a > CPU bound workload and hence any reduction in cache residency > results in a reduction of the number of concurrent jobs that can be > run. Curiosity and all that, but has this work produced any intuition on the sensitivity of the performance/scalability to the delays? As in the effect of microseconds vs. tens of microsecond vs. hundreds of microseconds? Thanx, Paul