Re: RCU stall in xfs_reclaim_inodes_ag

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 8 Dec 2020 19:07:28 +1100

On Mon, Dec 07, 2020 at 12:18:13PM +0200, Alex Lyakas wrote:
> Hi Dave,
> 
> Thank you for your response.
> 
> We did some more investigations on the issue, and we have the following
> findings:
> 
> 1) We tracked the max amount of inodes per AG radix tree. We found in our
> tests, that the max amount of inodes per AG radix tree was about 1.5M:
> [xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1368]: count=1384662
> reclaimable=58
> [xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1368]: count=1384630
> reclaimable=46
> [xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1368]: count=1384600
> reclaimable=16
> [xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1370]: count=1594500
> reclaimable=75
> [xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1370]: count=1594468
> reclaimable=55
> [xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1370]: count=1594436
> reclaimable=46
> [xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1370]: count=1594421
> reclaimable=42
> (but the amount of reclaimable inodes is very small, as you can see).
> 
> Do you think this number is reasonable per radix tree?

That's fine. I regularly run tests that push 10M+ inodes into a
single radix tree, and that generally doesn't even show up on the
profiles....

> 2) This particular XFS instance is total of 500TB. However, the AG size in
> this case is 100GB.

Ok. I run scalability tests on 500TB filesystems with 500AGs that
hammer inode reclaim, but it should be trivial to run them with
5000AGs. Hell, let's just run 50,000AGs to see if there's actually
an iteration problem in the shrinker...

(we really need async IO in mkfs for doing things like this!)

Yup, that hurts a bit on 5.10-rc7, but not significantly. Profile
with 16 CPUs turning over 250,000 inodes/s through the cache:

       - 3.17% xfs_fs_free_cached_objects                                                                                                         ▒
	  - xfs_reclaim_inodes_nr                                                                                                                 ▒
	     - 3.09% xfs_reclaim_inodes_ag                                                                                                        ▒
		- 0.91% _raw_spin_lock                                                                                                            ▒
		     0.87% do_raw_spin_lock                                                                                                       ▒
		- 0.71% _raw_spin_unlock                                                                                                          ▒
		   - 0.67% do_raw_spin_unlock                                                                                                     ▒
			__raw_callee_save___pv_queued_spin_unlock

Which indicate spinlocks are the largest CPU user in that path.
That's likley the radix tree spin locks when removing inodes from
the AG because the upstream code now allows multiple reclaimers to
operate on the same AG.

But even with that, there isn't any sign of holdoff latencies,
scanning delays, etc occurring inside the RCU critical section.
IOWs, bumping up the number of AGs massively shouldn't impact the
RCU code here as the RCU crictical region is inside the loop over
the AGs, not spanning the loop.

I don't know how old your kernel is, but maybe something is getting
stuck on a spinlock (per-inode or per-ag) inside the RCU section?
i.e. maybe you see a RCU stall because the code has livelocked or
has severe contention on a per-ag or inode spinlock inside the RCU
section?

I suspect you are going to need to profile the code when it is
running to some idea of what it is actually doing when the stalls
occur...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx