Re: Ubuntu 12.04 MDS tcmalloc leaks

Milosz Tanski <milosz@xxxxxxxxx> · Mon, 23 Jun 2014 16:31:16 -0400



I create the issue here: http://tracker.ceph.com/issues/8648

Cheers,
- Milosz

On Mon, Jun 23, 2014 at 3:44 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
> Okay, this might just be as simple as us creating a new root inode
> without deallocating the old one (MDS::open_root_inode, called from
> MDS::boot_start). Can you create a ticket?
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> On Mon, Jun 23, 2014 at 3:31 PM, Milosz Tanski <milosz@xxxxxxxxx> wrote:
>> standby-replay
>>
>> On Mon, Jun 23, 2014 at 3:27 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>>> Ah, excellent. What standby modes are you using?
>>> -Greg
>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>
>>>
>>> On Mon, Jun 23, 2014 at 2:54 PM, Milosz Tanski <milosz@xxxxxxxxx> wrote:
>>>> I've spent more time looking at this over the long time frame (since my last
>>>> email in April) and I think I'm closer to understanding to what's going on
>>>> here. I believe I was wrong in my original assumption that this is caused by
>>>> tcmalloc since I tried this without tcmalloc (using glibc) and I was still
>>>> exhibiting behavior.
>>>>
>>>> Having said that I think came onto a suggestion what might be wrong. When
>>>> doing a version upgrade my MDS server primary / standby have switched... and
>>>> now the other mds sever that was never running into MDS OOM scenarios has
>>>> started going it and the one that was having the issue stopped. I ended up
>>>> swapping the standby a couple times and it looks like it's the standby code
>>>> that's causing this leak.
>>>>
>>>> TL;DR Standby is the one the leak... not sure what it is, but the primary
>>>> doesn't exhibit this behavior.
>>>>
>>>> Best
>>>> - Milosz
>>>>
>>>>
>>>> On Mon, Apr 14, 2014 at 3:11 PM, Milosz Tanski <milosz@xxxxxxxxx> wrote:
>>>>>
>>>>> Sorry for not including the last on last email. It was an accident.
>>>>>
>>>>> On Fri, Apr 11, 2014 at 6:23 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>>>>> > On Fri, Apr 11, 2014 at 11:07 AM, Milosz Tanski <milosz@xxxxxxxxx>
>>>>> > wrote:
>>>>> >> On Fri, Apr 11, 2014 at 1:07 PM, Gregory Farnum <greg@xxxxxxxxxxx>
>>>>> >> wrote:
>>>>> >>> On Fri, Apr 11, 2014 at 8:59 AM, Milosz Tanski <milosz@xxxxxxxxx>
>>>>> >>> wrote:
>>>>> >>>> I'd like to restart this debate about tcmalloc slow leaks in MDS.
>>>>> >>>> This
>>>>> >>>> time around I have some charts. Looking at OSDs and MONs, it doesn't
>>>>> >>>> seam to affect those (as much).
>>>>> >>>>
>>>>> >>>> Here's the chart: http://i.imgur.com/xMCINAD.png The first two humps
>>>>> >>>> are the latest stable MDS version with tcmalloc till MDS gets killed
>>>>> >>>> by the OOM killer. The last restart MDS build of the same git tag
>>>>> >>>> without tcmalloc linked into it.
>>>>> >>>
>>>>> >>> That's interesting, but your graph cuts off before we can really see
>>>>> >>> the long-term behavior of the no-tcmalloc case. :) What's the
>>>>> >>> longer-term pattern look like?
>>>>> >>
>>>>> >> I'm only about two weeks into running without the allocator. I'm going
>>>>> >> to continue running it and report back in two weeks and a month. Sadly
>>>>> >> it takes a long time to test / reproduce the issue.
>>>>> >
>>>>> > Hmm, that makes it sound to me like it's not a tcmalloc issue, but
>>>>> > something changing in MDS state (a new workload that loads too much
>>>>> > into memory or something).
>>>>>
>>>>> 13 days into last startup so far and the needle hasn't move on memory
>>>>> usage (stable since 3 days in). Previously it took 20 days (twice in a
>>>>> row) to get to OOM. But by now it would have grown much larger. The
>>>>> workload hasn't changed.
>>>>>
>>>>> >
>>>>> >>>> I know that older tcmalloc version have leaks when allocating larger
>>>>> >>>> blocks of memory:
>>>>> >>>> https://code.google.com/p/gperftools/issues/detail?id=368 So it's
>>>>> >>>> possible that there is some kind of allocation pattern in MDS that
>>>>> >>>> causes this behavior or exposes this tcmalloc bug.
>>>>> >>>
>>>>> >>> Hrm, we do use memory pools in the MDS that the OSD and monitor do
>>>>> >>> not, so that could be influencing things.
>>>>> >>
>>>>> >> The issue I linked to is caused generally by making large allocations.
>>>>> >> It's my understanding that prior to the fix was very bad
>>>>> >> fragmenetation with large allocations.
>>>>> >>
>>>>> >>>
>>>>> >>>> Last time I bought it up there was resistance to tossing tcmalloc,
>>>>> >>>> which is fine. What I'd like to see is not linking against tcmalloc
>>>>> >>>> on
>>>>> >>>> systems that are know to have a buggy tcmalloc (in this case ubuntu
>>>>> >>>> 12.04, older Debian systems).
>>>>> >>>
>>>>> >>> The issue is that back when we did the investigation and testing (on
>>>>> >>> older Debian systems) that made us switch to tcmalloc:
>>>>> >>> 1) Memory growth without tcmalloc on the OSDs and monitor was so bad
>>>>> >>> as to make them essentially unusable,
>>>>> >>> 2) the MDS also behaved better with it (though I don't remember how
>>>>> >>> much)
>>>>> >>> 3) tcmalloc supplies some really nice memory analysis tools that I'd
>>>>> >>> like to keep around.
>>>>> >>>
>>>>> >>> So we'd need to do something like find a different allocator that
>>>>> >>> works for all three processes, or link the OSD and monitor with it but
>>>>> >>> not the MDS *and* demonstrate that the default allocators in each of
>>>>> >>> our platforms work for the MDS without issue (or go down the rat's
>>>>> >>> nest of selecting allocator based on platform). Before we embark on
>>>>> >>> that I'd like to get more data about what's causing the memory growth.
>>>>> >>> Can you gather some heap dumps and stats? Have you tried just
>>>>> >>> instructing the MDS to release unused memory when it passes some
>>>>> >>> threshold?
>>>>> >>
>>>>> >> For another internal project we started off with tcmalloc and switched
>>>>> >> to jemalloc. We ran into the same kind of pattern with tcmalloc on
>>>>> >> ubuntu 12.04.
>>>>> >>
>>>>> >> Now in our case doing database equivalent of sorting 10s to low 100s
>>>>> >> of gigabytes in background process (maintenance jobs for compacting
>>>>> >> and dup removal) we did this in blocks of 0.25g using merge sort.
>>>>> >> After about a day of runtime (when a lot of these jobs ran) we would
>>>>> >> start running into OOM cases. I enabled the tcmalloc debugger (via
>>>>> >> flags) and it would log every 1gb allocated. Tcmalloc reported that
>>>>> >> the app was using low gigabytes of working memory during busy times
>>>>> >> and and going into the low 10s of megabytes at idle times. Yet despite
>>>>> >> those the memory consumed by the process was reaching 40 gigs.
>>>>> >
>>>>> > Did you try using the HeapRelease() command (or whatever it's called)?
>>>>> > A few users have reported that tcmalloc was broken in one way or
>>>>> > another on their platform (though usually on something like Gentoo
>>>>> > rather than Ubuntu Precise!) and that call has invariably dealt with
>>>>> > the issue. *shrug*
>>>>>
>>>>> For our use case I did end up playing with the various configuration
>>>>> knobs for TCMALLOC (via environmental variables.) None of them ended
>>>>> up helping (release rate, etc). We did not end up calling the tcmalloc
>>>>> functions directly (like HeapRelease) because we didn't want to have
>>>>> our app depend on tcmalloc. And, quite frankly I thought it was silly
>>>>> for us to jump through a lot of hoops in order to make the allocator
>>>>> not explode.
>>>>>
>>>>> >
>>>>> >> We considered building tcmalloc from source, but noticed that redis in
>>>>> >> ubuntu/debian jemalloc and switched to using it. In this case, yes I'm
>>>>> >> shilling for jemalloc because it solved similar issues with
>>>>> >> experienced. And after doing significant testing on performance to
>>>>> >> compare the two it was within margin of error. Recent version of
>>>>> >> jemalloc support can output heap profiling information in a format
>>>>> >> understood by pprof (the google perftools).
>>>>> >
>>>>> > Interesting. Next time we wrangle some time to look at these issues
>>>>> > I'll check jemalloc out.
>>>>> > -Greg
>>>>> > Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Milosz Tanski
>>>>> CTO
>>>>> 10 East 53rd Street, 37th floor
>>>>> New York, NY 10022
>>>>>
>>>>> p: 646-253-9055
>>>>> e: milosz@xxxxxxxxx
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Milosz Tanski
>>>> CTO
>>>> 16 East 34th Street, 15th floor
>>>> New York, NY 10016
>>>>
>>>> p: 646-253-9055
>>>> e: milosz@xxxxxxxxx
>>
>>
>>
>> --
>> Milosz Tanski
>> CTO
>> 16 East 34th Street, 15th floor
>> New York, NY 10016
>>
>> p: 646-253-9055
>> e: milosz@xxxxxxxxx


-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@xxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html