Software RAID memory issue?

Richard Alloway <richard.alloway@xxxxxxxxxxxxx> · Wed, 5 Dec 2018 15:05:10 +0000

Hi Neil!

I got this email address from the Contact page of your neil.brown.name website and hope that you can point me in the right direction.

The issue is this:  I’m looking into a performance issue at a university which seems to come about by the raid6-md0 kernel task consuming more and more memory during consistency checks that is never released.

When the weekly /etc/cron.d/raid-check is executed, the university noticed that the system would lose “about 5 GB” of free memory each week.  The performance of the system would also deteriorate slightly after the raid-check completed.

Jobs are scheduled to execute on these servers… jobs which can take several thousand minutes to complete, which is how they are noticing the performance degradation.

The server in question is running CentOS 7.5 with kernel 3.10.0-862.3.2 on PPC64.  The RAID array is 114.6 TB and consists of 24x 5.5 TB drives (21+2x 5.5 TB drives + 1x 5.5 TB spare).  All drives are directly connected via SAS enclosure.  The system has 256 GB of RAM.  

I don’t think these specifics are directly prudent as I’ve been able to replicate the increasing memory footprint of the raid6_md0 task on a x86_64 VM with 1GB of RAM and a RAID6 consisting of 4x 50 GB VHDs.  The reason I mention these specifics is that there appears to be a linear relationship to the memory usage on my test VM and the actual baremetal server, which I’ll include below.

Immediately after a reboot, I see the following on my VM:

================================================ 
# egrep '^#|raid' /proc/slabinfo | sed 's/^#//' | column -t
name       <active_objs>  <num_objs>  <objsize>  <objperslab>  <pagesperslab>  :  tunables  <limit>  <batchcount>  <sharedfactor>  :  slabdata  <active_slabs>  <num_slabs>  <sharedavail>
raid6-md0  272            272         1864       17            8               :  tunables  0        0             0               :  slabdata  16              16           0
================================================

The array is empty – no filesystems, partitions, or anything, so the disks are idle.

If I trigger a raid-check manually, and then re-examine the slabinfo:

================================================ 
# /usr/sbin/raid-check ; egrep '^#|raid' /proc/slabinfo | sed 's/^#//' | column -t
name       <active_objs>  <num_objs>  <objsize>  <objperslab>  <pagesperslab>  :  tunables  <limit>  <batchcount>  <sharedfactor>  :  slabdata  <active_slabs>  <num_slabs>  <sharedavail>
raid6-md0  3060           3060        1864       17            8               :  tunables  0        0             0               :  slabdata  180             180          0
================================================

Executing the raid-check a second time, the memory usage increases again:

================================================ 
# /usr/sbin/raid-check ; egrep '^#|raid' /proc/slabinfo | sed 's/^#//' | column -t name       <active_objs>  <num_objs>  <objsize>  <objperslab>  <pagesperslab>  :  tunables <limit>  <batchcount>  <sharedfactor>  :  slabdata  <active_slabs>  <num_slabs>  <sharedavail>
raid6-md0  4420           4420        1864       17            8               :  tunables  0        0             0               :  slabdata  260             260          0
================================================

So, this accounts for the loss of available memory. 

Without knowing what’s going on inside the module, I think what may be causing the dip in performance may be that the kernel module might be maintaining a list or tree that must be traversed when disk IO is requested, and it is this list/tree that is growing and not being pruned correctly after the consistency check.   This is a gut feeling, not based on anything that I've seen in the source.

For small arrays, the additional memory usage consumed each week is minimal and the performance hit is very minimal. 
For relatively large arrays, like the university has, the memory consumption and the performance hit on long duration jobs become more readily apparent.

I also tried turning on slab tracing for the md module and see more alloc calls than free calls during the consistency check:

================================================ 
# echo 1 > /sys/kernel/slab/raid6-md0/trace 
# /usr/sbin/raid-check ; egrep '^#|raid' /proc/slabinfo | sed 's/^#//' | column -t
name       <active_objs>  <num_objs>  <objsize>  <objperslab>  <pagesperslab>  :  tunables  <limit>  <batchcount>  <sharedfactor>  :  slabdata  <active_slabs>  <num_slabs>  <sharedavail>
raid6-md0  4060           4068        1728       18            8               :  tunables  0        0             0               :  slabdata  226             226          0 
# grep 'TRACE raid6-md0 alloc' messages | wc -l
1520
# grep 'TRACE raid6-md0 free' messages | wc -l
0
================================================

This shows many allocs and no frees during the first check.  I performed another check and saw the active objects climb to 5185 and there were 1015 more allocs and still no frees. 

================================================
Dec  4 16:44:26 localhost kernel: TRACE raid6-md0 alloc 0xffff8e6ed57b7480 inuse=17 fp=0x          (null)
Dec  4 16:44:26 localhost kernel: CPU: 3 PID: 443 Comm: md0_raid6 Not tainted 3.10.0-862.3.3.el7.x86_64.debug #1
Dec  4 16:44:26 localhost kernel: Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
Dec  4 16:44:26 localhost kernel: Call Trace:
Dec  4 16:44:26 localhost kernel: [<ffffffff9dbe9181>] dump_stack+0x19/0x1b
Dec  4 16:44:26 localhost kernel: [<ffffffff9dbe5d5b>] alloc_debug_processing+0xc5/0x118
Dec  4 16:44:26 localhost kernel: [<ffffffff9d64380a>] ___slab_alloc+0x53a/0x560
Dec  4 16:44:26 localhost kernel: [<ffffffffc07c249e>] ? alloc_stripe+0x2e/0x190 [raid456]
Dec  4 16:44:26 localhost kernel: [<ffffffff9dbf38d6>] ? _raw_spin_unlock_irqrestore+0x36/0x70
Dec  4 16:44:26 localhost kernel: [<ffffffff9d4745b3>] ? kvm_clock_read+0x33/0x40
Dec  4 16:44:26 localhost kernel: [<ffffffffc07c249e>] ? alloc_stripe+0x2e/0x190 [raid456]
Dec  4 16:44:26 localhost kernel: [<ffffffff9dbe6051>] __slab_alloc+0x46/0x7d
Dec  4 16:44:26 localhost kernel: [<ffffffffc07c249e>] ? alloc_stripe+0x2e/0x190 [raid456]
Dec  4 16:44:26 localhost kernel: [<ffffffff9d643b47>] kmem_cache_alloc+0x317/0x3e0
Dec  4 16:44:26 localhost kernel: [<ffffffffc07c249e>] alloc_stripe+0x2e/0x190 [raid456]
Dec  4 16:44:26 localhost kernel: [<ffffffffc07c766d>] grow_one_stripe+0x2d/0xf0 [raid456]
Dec  4 16:44:26 localhost kernel: [<ffffffffc07d34b6>] raid5d+0x7e6/0x880 [raid456]
Dec  4 16:44:26 localhost kernel: [<ffffffff9d52f73d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
Dec  4 16:44:26 localhost kernel: [<ffffffff9d9d5f6b>] md_thread+0x15b/0x1a0
Dec  4 16:44:26 localhost kernel: [<ffffffff9d4d7880>] ? wake_up_atomic_t+0x30/0x30
Dec  4 16:44:26 localhost kernel: [<ffffffff9d9d5e10>] ? find_pers+0x80/0x80
Dec  4 16:44:26 localhost kernel: [<ffffffff9d4d64cf>] kthread+0xef/0x100
Dec  4 16:44:26 localhost kernel: [<ffffffff9d4d63e0>] ? insert_kthread_work+0x80/0x80
Dec  4 16:44:26 localhost kernel: [<ffffffff9dbff1f7>] ret_from_fork_nospec_begin+0x21/0x21
Dec  4 16:44:26 localhost kernel: [<ffffffff9d4d63e0>] ? insert_kthread_work+0x80/0x80
================================================

Looking at the backtrace, it appears to be the grow_one_stripe() function call within raid5d() that is allocating all of the RAM while raid5_cache_scan()’s calls to drop_one_stripe() are where it is deallocated.

Looking at the objsize and active_objs, I can see that the memory footprint of the raid6-md0 task increases from 507 KB (1864 * 272) after boot to 5.7 MB (1864 * 3060) after the first check and growing to 9.2 MB (1864 * 4931) after the 3rd check.  This agrees with what slabtop reports.

If we linearly scale this up 1146x (from 100 GB to 114.6 TB), the first raid-check would consume about 6.5 GB of additional RAM, which is close to the “about 5 GB” of free RAM reported to be lost each week.

To see if perhaps this issue was addressed in a newer kernel, I installed kernel 4.19.5-1.el7.elrepo.x86_64 and re-ran my tests immediately after a reboot.

I saw the same 272 active objects to start, but the active objects increased much more with this kernel, leading me to believe that the issue has either not been resolved, and may actually be exacerbated, in a newer kernel:

================================================ 
# uname -r
4.19.5-1.el7.elrepo.x86_64
# egrep '^#|raid' /proc/slabinfo | sed 's/^#//' | column -t
name       <active_objs>  <num_objs>  <objsize>  <objperslab>  <pagesperslab>  :  tunables  <limit>  <batchcount>  <sharedfactor>  :  slabdata  <active_slabs>  <num_slabs>  <sharedavail>
raid6-md0  272            272         1856       17            8               :  tunables  0        0             0               :  slabdata  16              16           0
# /usr/sbin/raid-check ; egrep '^#|raid' /proc/slabinfo | sed 's/^#//' | column -t
name       <active_objs>  <num_objs>  <objsize>  <objperslab>  <pagesperslab>  :  tunables  <limit>  <batchcount>  <sharedfactor>  :  slabdata  <active_slabs>  <num_slabs>  <sharedavail>
raid6-md0  5117           5117        1856       17            8               :  tunables  0        0             0               :  slabdata  301             301          0
# /usr/sbin/raid-check ; egrep '^#|raid' /proc/slabinfo | sed 's/^#//' | column -t
name       <active_objs>  <num_objs>  <objsize>  <objperslab>  <pagesperslab>  :  tunables  <limit>  <batchcount>  <sharedfactor>  :  slabdata  <active_slabs>  <num_slabs>  <sharedavail>
raid6-md0  7089           7089        1856       17            8               :  tunables  0        0             0               :  slabdata  417             417          0
================================================ 

As you can see, after the 3rd consistency check with a newer kernel, the memory consumption is about 13.1 MB (1856 * 7089).

Do you have any suggestions on how I can troubleshoot this further? 

Thanks!

-Rich