On Mon, Dec 07, 2020 at 12:18:13PM +0200, Alex Lyakas wrote: > Hi Dave, > > Thank you for your response. > > We did some more investigations on the issue, and we have the following > findings: > > 1) We tracked the max amount of inodes per AG radix tree. We found in our > tests, that the max amount of inodes per AG radix tree was about 1.5M: > [xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1368]: count=1384662 > reclaimable=58 > [xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1368]: count=1384630 > reclaimable=46 > [xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1368]: count=1384600 > reclaimable=16 > [xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1370]: count=1594500 > reclaimable=75 > [xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1370]: count=1594468 > reclaimable=55 > [xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1370]: count=1594436 > reclaimable=46 > [xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1370]: count=1594421 > reclaimable=42 > (but the amount of reclaimable inodes is very small, as you can see). > > Do you think this number is reasonable per radix tree? That's fine. I regularly run tests that push 10M+ inodes into a single radix tree, and that generally doesn't even show up on the profiles.... > 2) This particular XFS instance is total of 500TB. However, the AG size in > this case is 100GB. Ok. I run scalability tests on 500TB filesystems with 500AGs that hammer inode reclaim, but it should be trivial to run them with 5000AGs. Hell, let's just run 50,000AGs to see if there's actually an iteration problem in the shrinker... (we really need async IO in mkfs for doing things like this!) Yup, that hurts a bit on 5.10-rc7, but not significantly. Profile with 16 CPUs turning over 250,000 inodes/s through the cache: - 3.17% xfs_fs_free_cached_objects ▒ - xfs_reclaim_inodes_nr ▒ - 3.09% xfs_reclaim_inodes_ag ▒ - 0.91% _raw_spin_lock ▒ 0.87% do_raw_spin_lock ▒ - 0.71% _raw_spin_unlock ▒ - 0.67% do_raw_spin_unlock ▒ __raw_callee_save___pv_queued_spin_unlock Which indicate spinlocks are the largest CPU user in that path. That's likley the radix tree spin locks when removing inodes from the AG because the upstream code now allows multiple reclaimers to operate on the same AG. But even with that, there isn't any sign of holdoff latencies, scanning delays, etc occurring inside the RCU critical section. IOWs, bumping up the number of AGs massively shouldn't impact the RCU code here as the RCU crictical region is inside the loop over the AGs, not spanning the loop. I don't know how old your kernel is, but maybe something is getting stuck on a spinlock (per-inode or per-ag) inside the RCU section? i.e. maybe you see a RCU stall because the code has livelocked or has severe contention on a per-ag or inode spinlock inside the RCU section? I suspect you are going to need to profile the code when it is running to some idea of what it is actually doing when the stalls occur... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx