Re: Software RAID memory issue?

Richard Alloway <richard.alloway@xxxxxxxxxxxxx> · Thu, 6 Dec 2018 19:13:25 +0000

Hi Phil!

On 12/6/18, 08:51, "Phil Turmel" <philip@xxxxxxxxxx> wrote:

    Good morning Richard,

    On 12/5/18 10:05 AM, Richard Alloway wrote:
    > Hi Neil!
    >  
    > I got this email address from the Contact page of your neil.brown.name website and hope that you can point me in the right direction.

    In case it isn't obvious, this is the public mailing list for generic
    Linux Software Raid and related topics, not Neil's personal address.
    And though he still contributes, he gave up maintainership a few years ago.

Boy!  Is my face red! 😉   I apologize to the list members for this faux pas!

    Participants here are generally volunteers with varied experience and
    platforms, fwiw.  Speaking of platforms:

    > ================================================
    > Dec  4 16:44:26 localhost kernel: TRACE raid6-md0 alloc 0xffff8e6ed57b7480 inuse=17 fp=0x          (null)
    > Dec  4 16:44:26 localhost kernel: CPU: 3 PID: 443 Comm: md0_raid6 Not tainted 3.10.0-862.3.3.el7.x86_64.debug #1
    > Dec  4 16:44:26 localhost kernel: Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006

    When I first saw this, I thought "Hmmm. a job for RedHat support, not
    the linux-raid ML".

The system originally exhibiting this issue is a bare-metal CentOS system, so there is no Red Hat support available.  My local test system is a CentOS VM running under the current release of VirtualBox, though.

    > ================================================ 
    > # uname -r
    > 4.19.5-1.el7.elrepo.x86_64

    So some readers may have missed your testing on a fairly current kernel.
     However, it may still be a RedHat question, as very old commercial
    software is in the mix (Vbox).

The issue was originally seen with CentOS kernel 3.10.0-862.3.3 CentOS kernel on bare-metal (no VirtualBox involved) and I've recreated the issue with the 3.10.0-862.3.3 CentOS kernel under VirtualBox as well as the 4.19.5-1 EPEL kernel under VirtualBox.  

I believe that this should be sufficient to isolate the issue to the Linux kernel itself and not to any specific distribution or hardware.

    > Do you have any suggestions on how I can troubleshoot this further? 

    Consider setting up a test environment on bare metal with a vanilla
    kernel and repeating your testing.  Reply to this thread to keep it
    together.

    Or set up an EL7 environment with their latest kernel on bare metal and
    retest.  If still occurring, open a ticket with RedHat and report both
    there and here (we'll still be interested).

If I can isolate the issue enough that I feel opening a ticket will lead to a solution, I'll definitely do so.  In fact, I'd love to get to the point where I can resolve the problem and post a patch at the same time. (

    I haven't seen anything in recent kernels that would explain your memory
    leak, other than a driver losing track of an I/O request.

I don't believe that this bug is due to a recent change.  I think it is the magnitude of the RAID that is exposing it.

Standing up a CentOS 6.10 system and a CentOS 7.0 system and retesting resulted in some interesting results:

CentOS 6.10 (kernel 2.6.32-754.el6.x86_64) does *NOT* exhibit this issue:

================================================
# egrep '^#|raid' /proc/slabinfo | sed 's/^#//' | column -t
name                  <active_objs>  <num_objs>  <objsize>  <objperslab>  <pagesperslab>  :  tunables  <limit>  <batchcount>  <sharedfactor>  :  slabdata  <active_slabs>  <num_slabs>  <sharedavail>
raid6-md127           256            260         1416       5             2               :  tunables  24       12            8               :  slabdata  52              52           0
# /usr/sbin/raid-check ; egrep '^#|raid' /proc/slabinfo | sed 's/^#//' | column -t
name                  <active_objs>  <num_objs>  <objsize>  <objperslab>  <pagesperslab>  :  tunables  <limit>  <batchcount>  <sharedfactor>  :  slabdata  <active_slabs>  <num_slabs>  <sharedavail>
raid6-md127           256            260         1416       5             2               :  tunables  24       12            8               :  slabdata  52              52           0
================================================

CentOS 7.0 (kernel 3.10.0-123.el7.x86_64) also does *NOT* exhibit this issue, either:

================================================
# egrep '^#|raid' /proc/slabinfo | sed 's/^#//'
name                  <active_objs>  <num_objs>  <objsize>  <objperslab>  <pagesperslab>  :  tunables  <limit>  <batchcount>  <sharedfactor>  :  slabdata  <active_slabs>  <num_slabs>  <sharedavail>
raid6-md0                273            273         1552      21             8               :  tunables    0        0             0             :  slabdata  13             13            0
# /usr/sbin/raid-check ; egrep '^#|raid' /proc/slabinfo | sed 's/^#//' | column -t
name       <active_objs>  <num_objs>  <objsize>  <objperslab>  <pagesperslab>  :  tunables  <limit>  <batchcount>  <sharedfactor>  :  slabdata  <active_slabs>  <num_slabs>  <sharedavail>
raid6-md0                273            273         1552       21            8               :  tunables  0        0             0               :  slabdata  13              13           0
================================================

This means that somewhere between 3.10.0-123 and 3.10.0-862.3.3, a change was implemented that introduced this issue.

I guess my primary asks here are to see if this is a known issue (with a known resolution, ideally) and/or to find out how to debug the memory allocation specifically within the software RAID kernel modules.

In the meantime, I'll be working to isolate the kernel patch that results in the additional memory allocation (or prevents allocated memory from being freed).

Thanks!

-Rich

    (FWIW, I'm not a kernel developer -- just a power user -- though I do
    have substantial non-kernel C experience.)

    Phil