Re: Software RAID memory issue?

Richard Alloway <richard.alloway@xxxxxxxxxxxxx> · Wed, 2 Jan 2019 19:23:20 +0000

On 12/11/18, 20:03, "NeilBrown" <neilb@xxxxxxxx> wrote:

<snipping a lot of the earlier conversation to save space>

    > Dropping the caches did reduce the active objects considerably, but not so much the total number of objects.

    I think this is an important observation.

    The value reported by stripe_cache_active isn't really the value I
    wanted. I wanted conf->max_nr_stripes, but you cannot get that out of
    md.
    It should be exactly the "active_objs" from the slab, and I suspect it
    is.

    The objects are big - I didn't realise how big that had become. 1696
    bytes with only 4 devices.  Imagine how big they would be with 16
    devices!
    The allocator chooses to use 8-page slabs, which will put pressure on
    anything else that needs large allocations.
    And when memory pressure causes md to free some, it doesn't try to free
    all allocations in a given slab, so It doesn't free as many slabs as it
    should.

I've actually got the slabinfo data from the array with 23 devices.  The object size is 8872 bytes there.

    In your example we ended up with about 5 active objects per slab, when
    19 fit.  So 60% of the space is wasted.  That is 85Meg wasted.

Ouch!  (

    This shouldn't get worse on subsequent checks, but it isn't good.
    The fact that your experiments suggest it does get worse could be a
    result of the fact that it allocated more memory easily, but doesn't
    release it very effectively.

I ran some tests before the holiday break and analyzed that data this morning, and it appears that the first consistency check leaves just over 7 MB of RAM allocated, and subsequent checks allocate an additional 1.5-2.1 MB of RAM.  

After boot, but before any checks, raid6-md0 is using about 466 KB of RAM (stock kernel), leaves about 7.5 MB of RAM after the first check, about 9.4 MB of RAM after the second check, about 10.9 MB of RAM after the third check and, if I drop caches, raid6-md0 reduces to about 8 MB of RAM... still more than after the first check.

raid6-md0 actually consumes about 3.1 MB additional RAM for each check, on average (based on 4 checks).  Since the first check consumes the most RAM, excluding it results in an average of 1.8 MB of additional RAM per check for checks 2 through 4.

    One possible approach to this problem would be to make the allocations
    smaller - either make the data structures smaller, or allocate the array
    of  'struct r5dev' separately.
    I doubt we could make it much smaller, some I don't think this would
    help a lot.

    The other approach is to find a way to free whole slabs.  This would
    require md/raid5 to have some visibility into the slab system to know
    which stripe_heads were in the same slab.  That is not easy,
    particularly as there are 3 different slab allocates which would need to
    be understood.
    If we could sort the free stripe_heads by address and free them in that
    order it might come close, but that isn't easy either.
    To minimize lock contention, we keep the inactive stripe_heads in 8
    different free lists and we would need to sort all the lists together.

    Actually, the multiple lists cause another problem.  Some of the code
    assumes that these lists are much the same size.  Each stripe_head is
    assigned to a list (->hash_lock_index) when created, and is always put
    back on the same list.
    When freeing stripe_heads, we strictly rotate around the different
    lists.  If one free list is empty (because all the stripe_heads with
    that index are in use), then we stop freeing, even if the other lists
    have many entries.

    Unfortunately I cannot think of any easy way to fix this.  It really
    requires someone to think carefully about how these stripe_heads are
    allocated and freed, and to find a new approach that addresses the
    issues.

    A quick-and-dirty hack might be to change both kmem_cache_create()
    calls in raid5.c to use roundup_pow_of_two() on their second argument.
    That would waste some memory, but should cause it to use the same slabs
    that kmalloc() uses, so that there is more chance that memory freed by
    raid5 can be used by something else.
    It would also mean it would use 1-page slabs instead of 8-page slabs.

This being an easy thing to change and test, I did this and ran the exact same set of tests (also before the holidays).

Now, instead of the first consistency check leaving ~7 MB of RAM allocated and further checks allocating ~1.8 MB more RAM, each check leaves approx. 30 KB of additional RAM allocated!  This is really only approximate, though, since I am taking all of the single page kmalloc slabs into account and cannot isolate the raid6-mdX slabs.   It's possible that no additional RAM is left allocated after each check using this modification.

Since the wasted memory seems to be much less and the retained allocations disappear, what would the negatives be, if any, of using this in an ongoing manner?  Besides the overhead of keeping the kernel up-to-date, of course.

Thanks!

-Rich

    NeilBrown