Hi Neil! I got this email address from the Contact page of your neil.brown.name website and hope that you can point me in the right direction. The issue is this: I’m looking into a performance issue at a university which seems to come about by the raid6-md0 kernel task consuming more and more memory during consistency checks that is never released. When the weekly /etc/cron.d/raid-check is executed, the university noticed that the system would lose “about 5 GB” of free memory each week. The performance of the system would also deteriorate slightly after the raid-check completed. Jobs are scheduled to execute on these servers… jobs which can take several thousand minutes to complete, which is how they are noticing the performance degradation. The server in question is running CentOS 7.5 with kernel 3.10.0-862.3.2 on PPC64. The RAID array is 114.6 TB and consists of 24x 5.5 TB drives (21+2x 5.5 TB drives + 1x 5.5 TB spare). All drives are directly connected via SAS enclosure. The system has 256 GB of RAM. I don’t think these specifics are directly prudent as I’ve been able to replicate the increasing memory footprint of the raid6_md0 task on a x86_64 VM with 1GB of RAM and a RAID6 consisting of 4x 50 GB VHDs. The reason I mention these specifics is that there appears to be a linear relationship to the memory usage on my test VM and the actual baremetal server, which I’ll include below. Immediately after a reboot, I see the following on my VM: ================================================ # egrep '^#|raid' /proc/slabinfo | sed 's/^#//' | column -t name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> raid6-md0 272 272 1864 17 8 : tunables 0 0 0 : slabdata 16 16 0 ================================================ The array is empty – no filesystems, partitions, or anything, so the disks are idle. If I trigger a raid-check manually, and then re-examine the slabinfo: ================================================ # /usr/sbin/raid-check ; egrep '^#|raid' /proc/slabinfo | sed 's/^#//' | column -t name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> raid6-md0 3060 3060 1864 17 8 : tunables 0 0 0 : slabdata 180 180 0 ================================================ Executing the raid-check a second time, the memory usage increases again: ================================================ # /usr/sbin/raid-check ; egrep '^#|raid' /proc/slabinfo | sed 's/^#//' | column -t name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> raid6-md0 4420 4420 1864 17 8 : tunables 0 0 0 : slabdata 260 260 0 ================================================ So, this accounts for the loss of available memory. Without knowing what’s going on inside the module, I think what may be causing the dip in performance may be that the kernel module might be maintaining a list or tree that must be traversed when disk IO is requested, and it is this list/tree that is growing and not being pruned correctly after the consistency check. This is a gut feeling, not based on anything that I've seen in the source. For small arrays, the additional memory usage consumed each week is minimal and the performance hit is very minimal. For relatively large arrays, like the university has, the memory consumption and the performance hit on long duration jobs become more readily apparent. I also tried turning on slab tracing for the md module and see more alloc calls than free calls during the consistency check: ================================================ # echo 1 > /sys/kernel/slab/raid6-md0/trace # /usr/sbin/raid-check ; egrep '^#|raid' /proc/slabinfo | sed 's/^#//' | column -t name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> raid6-md0 4060 4068 1728 18 8 : tunables 0 0 0 : slabdata 226 226 0 # grep 'TRACE raid6-md0 alloc' messages | wc -l 1520 # grep 'TRACE raid6-md0 free' messages | wc -l 0 ================================================ This shows many allocs and no frees during the first check. I performed another check and saw the active objects climb to 5185 and there were 1015 more allocs and still no frees. ================================================ Dec 4 16:44:26 localhost kernel: TRACE raid6-md0 alloc 0xffff8e6ed57b7480 inuse=17 fp=0x (null) Dec 4 16:44:26 localhost kernel: CPU: 3 PID: 443 Comm: md0_raid6 Not tainted 3.10.0-862.3.3.el7.x86_64.debug #1 Dec 4 16:44:26 localhost kernel: Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 Dec 4 16:44:26 localhost kernel: Call Trace: Dec 4 16:44:26 localhost kernel: [<ffffffff9dbe9181>] dump_stack+0x19/0x1b Dec 4 16:44:26 localhost kernel: [<ffffffff9dbe5d5b>] alloc_debug_processing+0xc5/0x118 Dec 4 16:44:26 localhost kernel: [<ffffffff9d64380a>] ___slab_alloc+0x53a/0x560 Dec 4 16:44:26 localhost kernel: [<ffffffffc07c249e>] ? alloc_stripe+0x2e/0x190 [raid456] Dec 4 16:44:26 localhost kernel: [<ffffffff9dbf38d6>] ? _raw_spin_unlock_irqrestore+0x36/0x70 Dec 4 16:44:26 localhost kernel: [<ffffffff9d4745b3>] ? kvm_clock_read+0x33/0x40 Dec 4 16:44:26 localhost kernel: [<ffffffffc07c249e>] ? alloc_stripe+0x2e/0x190 [raid456] Dec 4 16:44:26 localhost kernel: [<ffffffff9dbe6051>] __slab_alloc+0x46/0x7d Dec 4 16:44:26 localhost kernel: [<ffffffffc07c249e>] ? alloc_stripe+0x2e/0x190 [raid456] Dec 4 16:44:26 localhost kernel: [<ffffffff9d643b47>] kmem_cache_alloc+0x317/0x3e0 Dec 4 16:44:26 localhost kernel: [<ffffffffc07c249e>] alloc_stripe+0x2e/0x190 [raid456] Dec 4 16:44:26 localhost kernel: [<ffffffffc07c766d>] grow_one_stripe+0x2d/0xf0 [raid456] Dec 4 16:44:26 localhost kernel: [<ffffffffc07d34b6>] raid5d+0x7e6/0x880 [raid456] Dec 4 16:44:26 localhost kernel: [<ffffffff9d52f73d>] ? trace_hardirqs_on_caller+0xfd/0x1c0 Dec 4 16:44:26 localhost kernel: [<ffffffff9d9d5f6b>] md_thread+0x15b/0x1a0 Dec 4 16:44:26 localhost kernel: [<ffffffff9d4d7880>] ? wake_up_atomic_t+0x30/0x30 Dec 4 16:44:26 localhost kernel: [<ffffffff9d9d5e10>] ? find_pers+0x80/0x80 Dec 4 16:44:26 localhost kernel: [<ffffffff9d4d64cf>] kthread+0xef/0x100 Dec 4 16:44:26 localhost kernel: [<ffffffff9d4d63e0>] ? insert_kthread_work+0x80/0x80 Dec 4 16:44:26 localhost kernel: [<ffffffff9dbff1f7>] ret_from_fork_nospec_begin+0x21/0x21 Dec 4 16:44:26 localhost kernel: [<ffffffff9d4d63e0>] ? insert_kthread_work+0x80/0x80 ================================================ Looking at the backtrace, it appears to be the grow_one_stripe() function call within raid5d() that is allocating all of the RAM while raid5_cache_scan()’s calls to drop_one_stripe() are where it is deallocated. Looking at the objsize and active_objs, I can see that the memory footprint of the raid6-md0 task increases from 507 KB (1864 * 272) after boot to 5.7 MB (1864 * 3060) after the first check and growing to 9.2 MB (1864 * 4931) after the 3rd check. This agrees with what slabtop reports. If we linearly scale this up 1146x (from 100 GB to 114.6 TB), the first raid-check would consume about 6.5 GB of additional RAM, which is close to the “about 5 GB” of free RAM reported to be lost each week. To see if perhaps this issue was addressed in a newer kernel, I installed kernel 4.19.5-1.el7.elrepo.x86_64 and re-ran my tests immediately after a reboot. I saw the same 272 active objects to start, but the active objects increased much more with this kernel, leading me to believe that the issue has either not been resolved, and may actually be exacerbated, in a newer kernel: ================================================ # uname -r 4.19.5-1.el7.elrepo.x86_64 # egrep '^#|raid' /proc/slabinfo | sed 's/^#//' | column -t name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> raid6-md0 272 272 1856 17 8 : tunables 0 0 0 : slabdata 16 16 0 # /usr/sbin/raid-check ; egrep '^#|raid' /proc/slabinfo | sed 's/^#//' | column -t name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> raid6-md0 5117 5117 1856 17 8 : tunables 0 0 0 : slabdata 301 301 0 # /usr/sbin/raid-check ; egrep '^#|raid' /proc/slabinfo | sed 's/^#//' | column -t name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> raid6-md0 7089 7089 1856 17 8 : tunables 0 0 0 : slabdata 417 417 0 ================================================ As you can see, after the 3rd consistency check with a newer kernel, the memory consumption is about 13.1 MB (1856 * 7089). Do you have any suggestions on how I can troubleshoot this further? Thanks! -Rich