Re: xfs_reclaim_inodes_ag taking several seconds

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 01.08.20 00:32, Dave Chinner wrote:
On Fri, Jul 31, 2020 at 01:27:31PM +0200, Donald Buczek wrote:
Dear Linux people,

we have a backup server with two xfs filesystems on 101.9TB md-raid6 devices (16 * 7.3 T disks) each, Current Linux version is 5.4.54.
.....
root:done:/home/buczek/linux_problems/shrinker_semaphore/# cat /proc/meminfo
MemTotal:       263572332 kB

256GB of RAM.

MemFree:         2872368 kB
MemAvailable:   204193824 kB

200GB "available"

Buffers:            2568 kB
Cached:         164931356 kB

160GB in page cache

KReclaimable:   40079660 kB
Slab:           49988268 kB
SReclaimable:   40079660 kB

40GB in reclaimable slab objects.

IOWs, you have no free memory in the machine and so allocation
will frequently be dipping into memory reclaim to free up page cache
and slab caches to make memory available.

xfs_inode         30978282 31196832    960    4    1 : tunables   54   27    8 : slabdata 7799208 7799208    434

Yes, 30 million cached inodes.

bio_integrity_payload 29644966 30203481    192   21    1 : tunables  120   60    8 : slabdata 1438261 1438261    480

Either there is a memory leak in this slab, or it is shared with
something like the xfs_ili slab, which would indicate that most
of the cached inodes have been dirtied in memory at some point in
time.

I think you are right here:

    crash> p $s->name
    $84 = 0xffffffff82259401 "bio_integrity_payload"
    crash> p $s->refcount
    $88 = 8
    crash> p $s
    $92 = (struct kmem_cache *) 0xffff88bff92d2bc0
    crash> p sizeof(xfs_inode_log_item_t)
    $93 = 192
    crash> p $s->object_size
    $94 = 192

So if I understand you correctly, this is expected behavior with this kind of load and conceptual changes are already scheduled for kernel 5.9. I don't understand most of it, but isn't it true that with that planned changes the impact might be better limited to the filesystem, so that the performance of other areas of the system might improve? I'd love to test that with our load, but I don't want to risk our backup data and it would be difficult to produce the same load on a toy system. The patch set is not yet ready to be tested on production data, is it?

So I guess I'll try to put the backup processes into one or more cgroups to limit the memory available for their fs caches and leave some room for unrelated (maintenance) processes. I hope, that makes sense.

Thank you both four your analysis!

Donald

And if you have 30 million inodes in memory, and lots of them are
dirty, and the shrinkers are running, then they will be doing
dirty inode writeback to throttle memory reclaim to
ensure it makes progress and doesn't declare OOM and kill processes
permaturely.

You have spinning disks, RAID6. I'm betting that it can only clean a
couple of hundred inodes a second because RAID6 is severely IOP
limited for small writes (like inode clusters). And when you many,
many thousands (maybe millions) of dirty inodes, anything that has
to wait on inode writeback is going to be waiting for some time...

root:done:/home/buczek/linux_problems/shrinker_semaphore/# xfs_info /amd/done/C/C8024
meta-data=/dev/md0               isize=512    agcount=102, agsize=268435328 blks
          =                       sectsz=4096  attr=2, projid32bit=1
          =                       crc=1        finobt=1, sparse=1, rmapbt=0
          =                       reflink=0
data     =                       bsize=4096   blocks=27348629504, imaxpct=1
          =                       sunit=128    swidth=1792 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2

And full size journals, so the filesystem can hold an awful lot of
active dirty inodes in memory before it starts throttling on a full
journal (think millions of dirty inodes per filesystem)...

So, yeah, this is the classic "in memory operation is orders of
magnitude faster than disk operation" and it all comes crashing down
when something needs to wait for inodes to be written back. The
patchset Darrick pointed you at should fix the shrinker issue, but
it's likely that this will just push the problem to the next
throttling point, which is the journal filling up.

IOWs, I suspect fixing your shrinker problem is only going to make
the overload of dirty inodes in the system behave worse, because
running out of journal space cause *all modifications* to the
filesystem to start taking significant delays while they wait for
inode writeback to free journal space, not just have things
trying to register/unregister shrinkers take delays...

Cheers,

Dave.


--
Donald Buczek
buczek@xxxxxxxxxxxxx
Tel: +49 30 8413 1433



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux