Re: Software RAID memory issue?

NeilBrown <neilb@xxxxxxxx> · Wed, 12 Dec 2018 12:02:50 +1100

On Mon, Dec 10 2018, Richard Alloway wrote:

> On 12/9/18, 23:32, "NeilBrown" <neilb@xxxxxxxx> wrote:
>     
>     This is useful information, thanks.
>     
>     Can you repeat the experiment and also check the value in
>       /sys/block/md0/md/stripe_cache_active
>
> Hi Neil!
>
> Thanks for the response and the additional troubleshooting steps!
>
> Here is the result of checking /sys/block/md0/md/stripe_cache_active before, during and after the consistency check (Tested against CentOS kernel 3.10.0-514.el7.x86_64, which is the first one that exhibits this behavior):
>
> Before consistency check:
> ================================================
> # cat /proc/mdstat ; echo ; egrep '^#|raid' /proc/slabinfo | sed 's/^#//' | column -t ; echo -e "\n/sys/block/md0/md/stripe_cache_active: $(cat /sys/block/md0/md/stripe_cache_active)"
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sde[3] sdd[2] sdb[0] sdc[1]
>       104791040 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/4] [UUUU]
>
> unused devices: <none>
>
> name       <active_objs>  <num_objs>  <objsize>  <objperslab>  <pagesperslab>  :  tunables  <limit>  <batchcount>  <sharedfactor>  :  slabdata  <active_slabs>  <num_slabs>  <sharedavail>
> raid6-md0  266            266         1696       19            8               :  tunables  0        0             0               :  slabdata  14              14           0
>
> /sys/block/md0/md/stripe_cache_active: 0
> ================================================
>
>
> During consistency check:
> ================================================ 
> # cat /proc/mdstat ; echo ; egrep '^#|raid' /proc/slabinfo | sed 's/^#//' | column -t ; echo -e "\n/sys/block/md0/md/stripe_cache_active: $(cat /sys/block/md0/md/stripe_cache_active)"
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sde[3] sdd[2] sdb[0] sdc[1]
>       104791040 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/4] [UUUU]
>       [=>...................]  check =  6.3% (3311484/52395520) finish=19.7min speed=41438K/sec
>
> unused devices: <none>
>
> name       <active_objs>  <num_objs>  <objsize>  <objperslab>  <pagesperslab>  :  tunables  <limit>  <batchcount>  <sharedfactor>  :  slabdata  <active_slabs>  <num_slabs>  <sharedavail>
> raid6-md0  1387           1387        1696       19            8               :  tunables  0        0             0               :  slabdata  73              73           0
>
> /sys/block/md0/md/stripe_cache_active: 1320
> ================================================
>
>
> After consistency check:
> ================================================ 
> # cat /proc/mdstat ; echo ; egrep '^#|raid' /proc/slabinfo | sed 's/^#//' | column -t ; echo -e "\n/sys/block/md0/md/stripe_cache_active: $(cat /sys/block/md0/md/stripe_cache_active)"
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sde[3] sdd[2] sdb[0] sdc[1]
>       104791040 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/4] [UUUU]
>
> unused devices: <none>
>
> name       <active_objs>  <num_objs>  <objsize>  <objperslab>  <pagesperslab>  :  tunables  <limit>  <batchcount>  <sharedfactor>  :  slabdata  <active_slabs>  <num_slabs>  <sharedavail>
> raid6-md0  4522           4522        1696       19            8               :  tunables  0        0             0               :  slabdata  238             238          0
>
> /sys/block/md0/md/stripe_cache_active: 0
> ================================================
>     
>     This number can grow large, but should shrink again when there is memory
>     pressure, but maybe that isn't happening.
>     
>     If stripe_cache_active has a similar value to slabinfo, then memory
>     isn't getting lost, but the shrinker isn't working.
>     If it has a much smaller value then memory is getting lost.
>
> Before and after the consistency check, the value is zero.  During the consistency check, it does grow, similarly to what is in slabinfo, but when it drops afterwards, the slabinfo remains high.
>     
>     If it appears to be the former, try to stop the check, then
>       echo 3 > /proc/sys/vm/drop_caches
>     
>     that should aggressively flush lots of caches, including the stripe
>     cache.
>
> Even though stripe_cache_active dropped, I thought the output after dropping the caches may be helpful:
>
> ================================================
> # cat /proc/mdstat ; echo ; egrep '^#|raid' /proc/slabinfo | sed 's/^#//' | column -t ; echo -e "\n/sys/block/md0/md/stripe_cache_active: $(cat /sys/block/md0/md/stripe_cache_active)"
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sde[3] sdd[2] sdb[0] sdc[1]
>       104791040 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/4] [UUUU]
>
> unused devices: <none>
>
> name       <active_objs>  <num_objs>  <objsize>  <objperslab>  <pagesperslab>  :  tunables  <limit>  <batchcount>  <sharedfactor>  :  slabdata  <active_slabs>  <num_slabs>  <sharedavail>
> raid6-md0  4522           4522        1696       19            8               :  tunables  0        0             0               :  slabdata  238             238          0
>
> /sys/block/md0/md/stripe_cache_active: 0
> # echo 3 > /proc/sys/vm/drop_caches
> # cat /proc/mdstat ; echo ; egrep '^#|raid' /proc/slabinfo | sed 's/^#//' | column -t ; echo -e "\n/sys/block/md0/md/stripe_cache_active: $(cat /sys/block/md0/md/stripe_cache_active)"
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sde[3] sdd[2] sdb[0] sdc[1]
>       104791040 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/4] [UUUU]
>
> unused devices: <none>
>
> name       <active_objs>  <num_objs>  <objsize>  <objperslab>  <pagesperslab>  :  tunables  <limit>  <batchcount>  <sharedfactor>  :  slabdata  <active_slabs>  <num_slabs>  <sharedavail>
> raid6-md0  988            4446        1696       19            8               :  tunables  0        0             0               :  slabdata  234             234          0
>
> /sys/block/md0/md/stripe_cache_active: 0
> ================================================
>
> Dropping the caches did reduce the active objects considerably, but not so much the total number of objects.

I think this is an important observation.

The value reported by stripe_cache_active isn't really the value I
wanted. I wanted conf->max_nr_stripes, but you cannot get that out of
md.
It should be exactly the "active_objs" from the slab, and I suspect it
is.

The objects are big - I didn't realise how big that had become. 1696
bytes with only 4 devices.  Imagine how big they would be with 16
devices!
The allocator chooses to use 8-page slabs, which will put pressure on
anything else that needs large allocations.
And when memory pressure causes md to free some, it doesn't try to free
all allocations in a given slab, so It doesn't free as many slabs as it
should.

In your example we ended up with about 5 active objects per slab, when
19 fit.  So 60% of the space is wasted.  That is 85Meg wasted.

This shouldn't get worse on subsequent checks, but it isn't good.
The fact that your experiments suggest it does get worse could be a
result of the fact that it allocated more memory easily, but doesn't
release it very effectively.

One possible approach to this problem would be to make the allocations
smaller - either make the data structures smaller, or allocate the array
of  'struct r5dev' separately.
I doubt we could make it much smaller, some I don't think this would
help a lot.

The other approach is to find a way to free whole slabs.  This would
require md/raid5 to have some visibility into the slab system to know
which stripe_heads were in the same slab.  That is not easy,
particularly as there are 3 different slab allocates which would need to
be understood.
If we could sort the free stripe_heads by address and free them in that
order it might come close, but that isn't easy either.
To minimize lock contention, we keep the inactive stripe_heads in 8
different free lists and we would need to sort all the lists together.

Actually, the multiple lists cause another problem.  Some of the code
assumes that these lists are much the same size.  Each stripe_head is
assigned to a list (->hash_lock_index) when created, and is always put
back on the same list.
When freeing stripe_heads, we strictly rotate around the different
lists.  If one free list is empty (because all the stripe_heads with
that index are in use), then we stop freeing, even if the other lists
have many entries.

Unfortunately I cannot think of any easy way to fix this.  It really
requires someone to think carefully about how these stripe_heads are
allocated and freed, and to find a new approach that addresses the
issues.

A quick-and-dirty hack might be to change both kmem_cache_create()
calls in raid5.c to use roundup_pow_of_two() on their second argument.
That would waste some memory, but should cause it to use the same slabs
that kmalloc() uses, so that there is more chance that memory freed by
raid5 can be used by something else.
It would also mean it would use 1-page slabs instead of 8-page slabs.

NeilBrown

>
> Going back to the "this kernel has the issue, this kernel doesn't" investigation that I've been doing, the newest CentOS 7.2 kernel (3.10.0-327.36.3.el7) doesn't have this issue, but consistency checks take quite a bit longer, while the initial CentOS 7.3 kernel (3.10.0-514.el7) does.
>
> A diff on the two kernels shows 160 changelog entries referencing 25 unique Red Hat Bugzilla tickets credited to Heinz Mauelshagen, Jes Sorensen and Mike Snitzer.  Trying to track down the changes for each is proving a bit difficult as the changes that Red Hat puts into their kernels can be backports of fixes from newer official/upstream kernels or fixes that have not yet been merged into the upstream kernel.
>
> As luck would have it, Red Hat just updated their Bugzilla and I can no longer log in, so I can't even open a new issue until I get my access resolved.
>
> I know that the Red Hat releases would likely need to be investigated by Red Hat themselves since they are the ones patching the kernels that they release, but the patch(es) that are responsible for this issue, regardless of where they came from, must have been merged with the official kernel at some point since the issue is present in the ELRepo 4.19.5-1.el7 kernel.  (The ELRepo kernels being builds of unpatched source from kernel.org.)
>
> I guess I'll start testing vanilla kernels directly from kernel.org to find out which upstream kernel first exhibited this behavior.
>
> Thanks again!
>
> -Rich
>
>     
>     NeilBrown
>     
Attachment:
signature.asc

Description: PGP signature