Re: [PATCH] xfs: require an rcu grace period before inode recycle

Brian Foster <bfoster@xxxxxxxxxx> · Tue, 25 Jan 2022 13:30:36 -0500

On Tue, Jan 25, 2022 at 11:31:20AM +1100, Dave Chinner wrote:
> On Mon, Jan 24, 2022 at 06:29:18PM -0500, Brian Foster wrote:
> > On Tue, Jan 25, 2022 at 09:08:53AM +1100, Dave Chinner wrote:
> > > > FYI, I modified my repeated alloc/free test to do some batching and form
> > > > it into something more able to measure the potential side effect / cost
> > > > of the grace period sync. The test is a single threaded, file alloc/free
> > > > loop using a variable per iteration batch size. The test runs for ~60s
> > > > and reports how many total files were allocated/freed in that period
> > > > with the specified batch size. Note that this particular test ran
> > > > without any background workload. Results are as follows:
> > > > 
> > > > 	files		baseline	test
> > > > 
> > > > 	1		38480		38437
> > > > 	4		126055		111080
> > > > 	8		218299		134469
> > > > 	16		306619		141968
> > > > 	32		397909		152267
> > > > 	64		418603		200875
> > > > 	128		469077		289365
> > > > 	256		684117		566016
> > > > 	512		931328		878933
> > > > 	1024		1126741		1118891
> > > 
> > > Can you post the test code, because 38,000 alloc/unlinks in 60s is
> > > extremely slow for a single tight open-unlink-close loop. I'd be
> > > expecting at least ~10,000 alloc/unlink iterations per second, not
> > > 650/second.
> > > 
> > 
> > Hm, Ok. My test was just a bash script doing a 'touch <files>; rm
> > <files>' loop. I know there was application overhead because if I
> > tweaked the script to open an fd directly rather than use touch, the
> > single file performance jumped up a bit, but it seemed to wash away as I
> > increased the file count so I kept running it with larger sizes. This
> > seems off so I'll port it over to C code and see how much the numbers
> > change.
> 
> Yeah, using touch/rm becomes fork/exec bound very quickly. You'll
> find that using "echo > <file>" is much faster than "touch <file>"
> because it runs a shell built-in operation without fork/exec
> overhead to create the file. But you can't play tricks like that to
> replace rm:
> 

I had used 'exec' to open an fd (same idea) in the single file case and
tested with that, saw that the increase was consistent and took that
along with the increasing performance as batch sizes increased to mean
that the application overhead wasn't a factor as the test scaled. That
was clearly wrong, because if I port the whole thing to a C program the
baseline numbers are way off. I think what also threw me off is that the
single file test kernel case is actually fairly accurate between the two
tests. Anyways, here's a series of (single run, no averaging, etc.) test
runs with the updated test. Note that I reduced the runtime to 10s here
since the test was running so much faster. Otherwise this is the same
batched open/close -> unlink behavior:

                baseline        test
batch:  1       files:  893579  files:  41841
batch:  2       files:  912502  files:  41922
batch:  4       files:  930424  files:  42084
batch:  8       files:  932072  files:  41536
batch:  16      files:  930624  files:  41616
batch:  32      files:  777088  files:  41120
batch:  64      files:  567936  files:  57216
batch:  128     files:  579840  files:  96256
batch:  256     files:  548608  files:  174080
batch:  512     files:  546816  files:  246784
batch:  1024    files:  509952  files:  328704
batch:  2048    files:  505856  files:  399360
batch:  4096    files:  479232  files:  438272

So this shows that the performance delta is actually massive from the
start. For reference, a single threaded, empty file, non syncing,
fs_mark workload stabilizes at around ~55k files/sec on this fs. Both
kernels sort of converge to that rate as the batch size increases, only
the baseline kernel starts much faster and normalizes while the test
kernel starts much slower and improves (and still really doesn't hit the
mark even at a 4k batch size).

My takeaway from this is that we may need to find a way to mitigate this
overhead somewhat better than what the current patch does. Otherwise,
this is a significant dropoff from even a pure allocation workload in
simple mixed workload scenarios...

> $ time for ((i=0;i<1000;i++)); do touch /mnt/scratch/foo; rm /mnt/scratch/foo ; done
> 
> real    0m2.653s
> user    0m0.910s
> sys     0m2.051s
> $ time for ((i=0;i<1000;i++)); do echo > /mnt/scratch/foo; rm /mnt/scratch/foo ; done
> 
> real    0m1.260s
> user    0m0.452s
> sys     0m0.913s
> $ time ./open-unlink 1000 /mnt/scratch/foo
> 
> real    0m0.037s
> user    0m0.001s
> sys     0m0.030s
> $
> 
> Note the difference in system time between the three operations -
> almost all the difference in system CPU time is the overhead of
> fork/exec to run the touch/rm binaries, not do the filesystem
> operations....
> 
> > > > That's just a test of a quick hack, however. Since there is no real
> > > > urgency to inactivate an unlinked inode (it has no potential users until
> > > > it's freed),
> > > 
> > > On the contrary, there is extreme urgency to inactivate inodes
> > > quickly.
> > > 
> > 
> > Ok, I think we're talking about slightly different things. What I mean
> > above is that if a task removes a file and goes off doing unrelated
> > $work, that inode will just sit on the percpu queue indefinitely. That's
> > fine, as there's no functional need for us to process it immediately
> > unless we're around -ENOSPC thresholds or some such that demand reclaim
> > of the inode.
> 
> Yup, an occasional unlink sitting around for a while on an unlinked
> list isn't going to cause a performance problem.  Indeed, such
> workloads are more likely to benefit from the reduced unlink()
> syscall overhead and won't even notice the increase in background
> CPU overhead for inactivation of those occasional inodes.
> 
> > It sounds like what you're talking about is specifically
> > the behavior/performance of sustained file removal (which is important
> > obviously), where apparently there is a notable degradation if the
> > queues become deep enough to push the inode batches out of CPU cache. So
> > that makes sense...
> 
> Yup, sustained bulk throughput is where cache residency really
> matters. And for unlink, sustained unlink workloads are quite
> common; they often are something people wait for on the command line
> or make up a performance critical component of a highly concurrent
> workload so it's pretty important to get this part right.
> 
> > > Darrick made the original assumption that we could delay
> > > inactivation indefinitely and so he allowed really deep queues of up
> > > to 64k deferred inactivations. But with queues this deep, we could
> > > never get that background inactivation code to perform anywhere near
> > > the original synchronous background inactivation code. e.g. I
> > > measured 60-70% performance degradataions on my scalability tests,
> > > and nothing stood out in the profiles until I started looking at
> > > CPU data cache misses.
> > > 
> > 
> > ... but could you elaborate on the scalability tests involved here so I
> > can get a better sense of it in practice and perhaps observe the impact
> > of changes in this path?
> 
> The same conconrrent fsmark create/traverse/unlink workloads I've
> been running for the past decade+ demonstrates it pretty simply. I
> also saw regressions with dbench (both op latency and throughput) as
> the clinet count (concurrency) increased, and with compilebench.  I
> didn't look much further because all the common benchmarks I ran
> showed perf degradations with arbitrary delays that went away with
> the current code we have.  ISTR that parts of aim7/reaim scalability
> workloads that the intel zero-day infrastructure runs are quite
> sensitive to background inactivation delays as well because that's a
> CPU bound workload and hence any reduction in cache residency
> results in a reduction of the number of concurrent jobs that can be
> run.
> 

Ok, so if I (single threaded) create (via fs_mark), sync and remove 5m
empty files, the remove takes about a minute. If I just bump out the
current queue and block thresholds by 10x and repeat, that time
increases to about ~1m24s. If I hack up a kernel to disable queueing
entirely (i.e. fully synchronous inactivation), then I'm back to about a
minute again. So I'm not producing any performance benefit with
queueing/batching in this single threaded scenario, but I suspect the
10x threshold delta is at least measuring the negative effect of poor
caching..? (Any decent way to confirm that..?).

And of course if I take the baseline kernel and stick a
cond_synchronize_rcu() in xfs_inactive_ifree() it brings the batch test
numbers right back but slows the removal test way down. What I find
interesting however is that if I hack up something more mild like invoke
cond_synchronize_rcu() on the oldest inode in the current inactivation
batch, bump out the blocking threshold as above (but leave the queueing
threshold at 32), and leave the iget side cond_sync_rcu() to catch
whatever falls through, my 5m file remove test now completes ~5-10s
faster than baseline and I see the following results from the batched
alloc/free test:

batch:  1       files:  731923
batch:  2       files:  693020
batch:  4       files:  750948
batch:  8       files:  743296
batch:  16      files:  738720
batch:  32      files:  746240
batch:  64      files:  598464
batch:  128     files:  672896
batch:  256     files:  633856
batch:  512     files:  605184
batch:  1024    files:  569344
batch:  2048    files:  555008
batch:  4096    files:  524288

Hm?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
>