Hi Phil! On 12/6/18, 08:51, "Phil Turmel" <philip@xxxxxxxxxx> wrote: Good morning Richard, On 12/5/18 10:05 AM, Richard Alloway wrote: > Hi Neil! > > I got this email address from the Contact page of your neil.brown.name website and hope that you can point me in the right direction. In case it isn't obvious, this is the public mailing list for generic Linux Software Raid and related topics, not Neil's personal address. And though he still contributes, he gave up maintainership a few years ago. Boy! Is my face red! 😉 I apologize to the list members for this faux pas! Participants here are generally volunteers with varied experience and platforms, fwiw. Speaking of platforms: > ================================================ > Dec 4 16:44:26 localhost kernel: TRACE raid6-md0 alloc 0xffff8e6ed57b7480 inuse=17 fp=0x (null) > Dec 4 16:44:26 localhost kernel: CPU: 3 PID: 443 Comm: md0_raid6 Not tainted 3.10.0-862.3.3.el7.x86_64.debug #1 > Dec 4 16:44:26 localhost kernel: Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 When I first saw this, I thought "Hmmm. a job for RedHat support, not the linux-raid ML". The system originally exhibiting this issue is a bare-metal CentOS system, so there is no Red Hat support available. My local test system is a CentOS VM running under the current release of VirtualBox, though. > ================================================ > # uname -r > 4.19.5-1.el7.elrepo.x86_64 So some readers may have missed your testing on a fairly current kernel. However, it may still be a RedHat question, as very old commercial software is in the mix (Vbox). The issue was originally seen with CentOS kernel 3.10.0-862.3.3 CentOS kernel on bare-metal (no VirtualBox involved) and I've recreated the issue with the 3.10.0-862.3.3 CentOS kernel under VirtualBox as well as the 4.19.5-1 EPEL kernel under VirtualBox. I believe that this should be sufficient to isolate the issue to the Linux kernel itself and not to any specific distribution or hardware. > Do you have any suggestions on how I can troubleshoot this further? Consider setting up a test environment on bare metal with a vanilla kernel and repeating your testing. Reply to this thread to keep it together. Or set up an EL7 environment with their latest kernel on bare metal and retest. If still occurring, open a ticket with RedHat and report both there and here (we'll still be interested). If I can isolate the issue enough that I feel opening a ticket will lead to a solution, I'll definitely do so. In fact, I'd love to get to the point where I can resolve the problem and post a patch at the same time. ( I haven't seen anything in recent kernels that would explain your memory leak, other than a driver losing track of an I/O request. I don't believe that this bug is due to a recent change. I think it is the magnitude of the RAID that is exposing it. Standing up a CentOS 6.10 system and a CentOS 7.0 system and retesting resulted in some interesting results: CentOS 6.10 (kernel 2.6.32-754.el6.x86_64) does *NOT* exhibit this issue: ================================================ # egrep '^#|raid' /proc/slabinfo | sed 's/^#//' | column -t name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> raid6-md127 256 260 1416 5 2 : tunables 24 12 8 : slabdata 52 52 0 # /usr/sbin/raid-check ; egrep '^#|raid' /proc/slabinfo | sed 's/^#//' | column -t name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> raid6-md127 256 260 1416 5 2 : tunables 24 12 8 : slabdata 52 52 0 ================================================ CentOS 7.0 (kernel 3.10.0-123.el7.x86_64) also does *NOT* exhibit this issue, either: ================================================ # egrep '^#|raid' /proc/slabinfo | sed 's/^#//' name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> raid6-md0 273 273 1552 21 8 : tunables 0 0 0 : slabdata 13 13 0 # /usr/sbin/raid-check ; egrep '^#|raid' /proc/slabinfo | sed 's/^#//' | column -t name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> raid6-md0 273 273 1552 21 8 : tunables 0 0 0 : slabdata 13 13 0 ================================================ This means that somewhere between 3.10.0-123 and 3.10.0-862.3.3, a change was implemented that introduced this issue. I guess my primary asks here are to see if this is a known issue (with a known resolution, ideally) and/or to find out how to debug the memory allocation specifically within the software RAID kernel modules. In the meantime, I'll be working to isolate the kernel patch that results in the additional memory allocation (or prevents allocated memory from being freed). Thanks! -Rich (FWIW, I'm not a kernel developer -- just a power user -- though I do have substantial non-kernel C experience.) Phil