On Mon, Dec 10 2018, Richard Alloway wrote: > On 12/9/18, 23:32, "NeilBrown" <neilb@xxxxxxxx> wrote: > > This is useful information, thanks. > > Can you repeat the experiment and also check the value in > /sys/block/md0/md/stripe_cache_active > > Hi Neil! > > Thanks for the response and the additional troubleshooting steps! > > Here is the result of checking /sys/block/md0/md/stripe_cache_active before, during and after the consistency check (Tested against CentOS kernel 3.10.0-514.el7.x86_64, which is the first one that exhibits this behavior): > > Before consistency check: > ================================================ > # cat /proc/mdstat ; echo ; egrep '^#|raid' /proc/slabinfo | sed 's/^#//' | column -t ; echo -e "\n/sys/block/md0/md/stripe_cache_active: $(cat /sys/block/md0/md/stripe_cache_active)" > Personalities : [raid6] [raid5] [raid4] > md0 : active raid6 sde[3] sdd[2] sdb[0] sdc[1] > 104791040 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/4] [UUUU] > > unused devices: <none> > > name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> > raid6-md0 266 266 1696 19 8 : tunables 0 0 0 : slabdata 14 14 0 > > /sys/block/md0/md/stripe_cache_active: 0 > ================================================ > > > During consistency check: > ================================================ > # cat /proc/mdstat ; echo ; egrep '^#|raid' /proc/slabinfo | sed 's/^#//' | column -t ; echo -e "\n/sys/block/md0/md/stripe_cache_active: $(cat /sys/block/md0/md/stripe_cache_active)" > Personalities : [raid6] [raid5] [raid4] > md0 : active raid6 sde[3] sdd[2] sdb[0] sdc[1] > 104791040 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/4] [UUUU] > [=>...................] check = 6.3% (3311484/52395520) finish=19.7min speed=41438K/sec > > unused devices: <none> > > name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> > raid6-md0 1387 1387 1696 19 8 : tunables 0 0 0 : slabdata 73 73 0 > > /sys/block/md0/md/stripe_cache_active: 1320 > ================================================ > > > After consistency check: > ================================================ > # cat /proc/mdstat ; echo ; egrep '^#|raid' /proc/slabinfo | sed 's/^#//' | column -t ; echo -e "\n/sys/block/md0/md/stripe_cache_active: $(cat /sys/block/md0/md/stripe_cache_active)" > Personalities : [raid6] [raid5] [raid4] > md0 : active raid6 sde[3] sdd[2] sdb[0] sdc[1] > 104791040 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/4] [UUUU] > > unused devices: <none> > > name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> > raid6-md0 4522 4522 1696 19 8 : tunables 0 0 0 : slabdata 238 238 0 > > /sys/block/md0/md/stripe_cache_active: 0 > ================================================ > > This number can grow large, but should shrink again when there is memory > pressure, but maybe that isn't happening. > > If stripe_cache_active has a similar value to slabinfo, then memory > isn't getting lost, but the shrinker isn't working. > If it has a much smaller value then memory is getting lost. > > Before and after the consistency check, the value is zero. During the consistency check, it does grow, similarly to what is in slabinfo, but when it drops afterwards, the slabinfo remains high. > > If it appears to be the former, try to stop the check, then > echo 3 > /proc/sys/vm/drop_caches > > that should aggressively flush lots of caches, including the stripe > cache. > > Even though stripe_cache_active dropped, I thought the output after dropping the caches may be helpful: > > ================================================ > # cat /proc/mdstat ; echo ; egrep '^#|raid' /proc/slabinfo | sed 's/^#//' | column -t ; echo -e "\n/sys/block/md0/md/stripe_cache_active: $(cat /sys/block/md0/md/stripe_cache_active)" > Personalities : [raid6] [raid5] [raid4] > md0 : active raid6 sde[3] sdd[2] sdb[0] sdc[1] > 104791040 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/4] [UUUU] > > unused devices: <none> > > name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> > raid6-md0 4522 4522 1696 19 8 : tunables 0 0 0 : slabdata 238 238 0 > > /sys/block/md0/md/stripe_cache_active: 0 > # echo 3 > /proc/sys/vm/drop_caches > # cat /proc/mdstat ; echo ; egrep '^#|raid' /proc/slabinfo | sed 's/^#//' | column -t ; echo -e "\n/sys/block/md0/md/stripe_cache_active: $(cat /sys/block/md0/md/stripe_cache_active)" > Personalities : [raid6] [raid5] [raid4] > md0 : active raid6 sde[3] sdd[2] sdb[0] sdc[1] > 104791040 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/4] [UUUU] > > unused devices: <none> > > name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> > raid6-md0 988 4446 1696 19 8 : tunables 0 0 0 : slabdata 234 234 0 > > /sys/block/md0/md/stripe_cache_active: 0 > ================================================ > > Dropping the caches did reduce the active objects considerably, but not so much the total number of objects. I think this is an important observation. The value reported by stripe_cache_active isn't really the value I wanted. I wanted conf->max_nr_stripes, but you cannot get that out of md. It should be exactly the "active_objs" from the slab, and I suspect it is. The objects are big - I didn't realise how big that had become. 1696 bytes with only 4 devices. Imagine how big they would be with 16 devices! The allocator chooses to use 8-page slabs, which will put pressure on anything else that needs large allocations. And when memory pressure causes md to free some, it doesn't try to free all allocations in a given slab, so It doesn't free as many slabs as it should. In your example we ended up with about 5 active objects per slab, when 19 fit. So 60% of the space is wasted. That is 85Meg wasted. This shouldn't get worse on subsequent checks, but it isn't good. The fact that your experiments suggest it does get worse could be a result of the fact that it allocated more memory easily, but doesn't release it very effectively. One possible approach to this problem would be to make the allocations smaller - either make the data structures smaller, or allocate the array of 'struct r5dev' separately. I doubt we could make it much smaller, some I don't think this would help a lot. The other approach is to find a way to free whole slabs. This would require md/raid5 to have some visibility into the slab system to know which stripe_heads were in the same slab. That is not easy, particularly as there are 3 different slab allocates which would need to be understood. If we could sort the free stripe_heads by address and free them in that order it might come close, but that isn't easy either. To minimize lock contention, we keep the inactive stripe_heads in 8 different free lists and we would need to sort all the lists together. Actually, the multiple lists cause another problem. Some of the code assumes that these lists are much the same size. Each stripe_head is assigned to a list (->hash_lock_index) when created, and is always put back on the same list. When freeing stripe_heads, we strictly rotate around the different lists. If one free list is empty (because all the stripe_heads with that index are in use), then we stop freeing, even if the other lists have many entries. Unfortunately I cannot think of any easy way to fix this. It really requires someone to think carefully about how these stripe_heads are allocated and freed, and to find a new approach that addresses the issues. A quick-and-dirty hack might be to change both kmem_cache_create() calls in raid5.c to use roundup_pow_of_two() on their second argument. That would waste some memory, but should cause it to use the same slabs that kmalloc() uses, so that there is more chance that memory freed by raid5 can be used by something else. It would also mean it would use 1-page slabs instead of 8-page slabs. NeilBrown > > Going back to the "this kernel has the issue, this kernel doesn't" investigation that I've been doing, the newest CentOS 7.2 kernel (3.10.0-327.36.3.el7) doesn't have this issue, but consistency checks take quite a bit longer, while the initial CentOS 7.3 kernel (3.10.0-514.el7) does. > > A diff on the two kernels shows 160 changelog entries referencing 25 unique Red Hat Bugzilla tickets credited to Heinz Mauelshagen, Jes Sorensen and Mike Snitzer. Trying to track down the changes for each is proving a bit difficult as the changes that Red Hat puts into their kernels can be backports of fixes from newer official/upstream kernels or fixes that have not yet been merged into the upstream kernel. > > As luck would have it, Red Hat just updated their Bugzilla and I can no longer log in, so I can't even open a new issue until I get my access resolved. > > I know that the Red Hat releases would likely need to be investigated by Red Hat themselves since they are the ones patching the kernels that they release, but the patch(es) that are responsible for this issue, regardless of where they came from, must have been merged with the official kernel at some point since the issue is present in the ELRepo 4.19.5-1.el7 kernel. (The ELRepo kernels being builds of unpatched source from kernel.org.) > > I guess I'll start testing vanilla kernels directly from kernel.org to find out which upstream kernel first exhibited this behavior. > > Thanks again! > > -Rich > > > NeilBrown >
Attachment:
signature.asc
Description: PGP signature