Re: [0.48.3] OSD memory leak when scrubbing

Andrey Korolyov <andrey@xxxxxxx> · Sat, 16 Feb 2013 10:09:08 +0300

Can anyone who hit this bug please confirm that your system contains libc 2.15+?

On Tue, Feb 5, 2013 at 1:27 AM, Sébastien Han <han.sebastien@xxxxxxxxx> wrote:
> oh nice, the pattern also matches path :D, didn't know that
> thanks Greg
> --
> Regards,
> Sébastien Han.
>
>
> On Mon, Feb 4, 2013 at 10:22 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>> Set your /proc/sys/kernel/core_pattern file. :) http://linux.die.net/man/5/core
>> -Greg
>>
>> On Mon, Feb 4, 2013 at 1:08 PM, Sébastien Han <han.sebastien@xxxxxxxxx> wrote:
>>> ok I finally managed to get something on my test cluster,
>>> unfortunately, the dump goes to /
>>>
>>> any idea to change the destination path?
>>>
>>> My production / won't be big enough...
>>>
>>> --
>>> Regards,
>>> Sébastien Han.
>>>
>>>
>>> On Mon, Feb 4, 2013 at 10:03 PM, Dan Mick <dan.mick@xxxxxxxxxxx> wrote:
>>>> ...and/or do you have the corepath set interestingly, or one of the
>>>> core-trapping mechanisms turned on?
>>>>
>>>>
>>>> On 02/04/2013 11:29 AM, Sage Weil wrote:
>>>>>
>>>>> On Mon, 4 Feb 2013, S?bastien Han wrote:
>>>>>>
>>>>>> Hum just tried several times on my test cluster and I can't get any
>>>>>> core dump. Does Ceph commit suicide or something? Is it expected
>>>>>> behavior?
>>>>>
>>>>>
>>>>> SIGSEGV should trigger the usual path that dumps a stack trace and then
>>>>> dumps core.  Was your ulimit -c set before the daemon was started?
>>>>>
>>>>> sage
>>>>>
>>>>>
>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>> S?bastien Han.
>>>>>>
>>>>>>
>>>>>> On Sun, Feb 3, 2013 at 10:03 PM, S?bastien Han <han.sebastien@xxxxxxxxx>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Lo?c,
>>>>>>>
>>>>>>> Thanks for bringing our discussion on the ML. I'll check that tomorrow
>>>>>>> :-).
>>>>>>>
>>>>>>> Cheer
>>>>>>> --
>>>>>>> Regards,
>>>>>>> S?bastien Han.
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Feb 3, 2013 at 10:01 PM, S?bastien Han <han.sebastien@xxxxxxxxx>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi Lo?c,
>>>>>>>>
>>>>>>>> Thanks for bringing our discussion on the ML. I'll check that tomorrow
>>>>>>>> :-).
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>> --
>>>>>>>> Regards,
>>>>>>>> S?bastien Han.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary <loic@xxxxxxxxxxx> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> As discussed during FOSDEM, the script you wrote to kill the OSD when
>>>>>>>>> it
>>>>>>>>> grows too much could be amended to core dump instead of just being
>>>>>>>>> killed &
>>>>>>>>> restarted. The binary + core could probably be used to figure out
>>>>>>>>> where the
>>>>>>>>> leak is.
>>>>>>>>>
>>>>>>>>> You should make sure the OSD current working directory is in a file
>>>>>>>>> system
>>>>>>>>> with enough free disk space to accomodate for the dump and set
>>>>>>>>>
>>>>>>>>> ulimit -c unlimited
>>>>>>>>>
>>>>>>>>> before running it ( your system default is probably ulimit -c 0 which
>>>>>>>>> inhibits core dumps ). When you detect that OSD grows too much kill it
>>>>>>>>> with
>>>>>>>>>
>>>>>>>>> kill -SEGV $pid
>>>>>>>>>
>>>>>>>>> and upload the core found in the working directory, together with the
>>>>>>>>> binary in a public place. If the osd binary is compiled with -g but
>>>>>>>>> without
>>>>>>>>> changing the -O settings, you should have a larger binary file but no
>>>>>>>>> negative impact on performances. Forensics analysis will be made a lot
>>>>>>>>> easier with the debugging symbols.
>>>>>>>>>
>>>>>>>>> My 2cts
>>>>>>>>>
>>>>>>>>> On 01/31/2013 08:57 PM, Sage Weil wrote:
>>>>>>>>>>
>>>>>>>>>> On Thu, 31 Jan 2013, Sylvain Munaut wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I disabled scrubbing using
>>>>>>>>>>>
>>>>>>>>>>>> ceph osd tell \* injectargs '--osd-scrub-min-interval 1000000'
>>>>>>>>>>>> ceph osd tell \* injectargs '--osd-scrub-max-interval 10000000'
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> and the leak seems to be gone.
>>>>>>>>>>>
>>>>>>>>>>> See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD
>>>>>>>>>>> memory
>>>>>>>>>>> for the 12 osd processes over the last 3.5 days.
>>>>>>>>>>> Memory was rising every 24h. I did the change yesterday around 13h00
>>>>>>>>>>> and OSDs stopped growing. OSD memory even seems to go down slowly by
>>>>>>>>>>> small blocks.
>>>>>>>>>>>
>>>>>>>>>>> Of course I assume disabling scrubbing is not a long term solution
>>>>>>>>>>> and
>>>>>>>>>>> I should re-enable it ... (how do I do that btw ? what were the
>>>>>>>>>>> default values for those parameters)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> It depends on the exact commit you're on.  You can see the defaults
>>>>>>>>>> if
>>>>>>>>>> you
>>>>>>>>>> do
>>>>>>>>>>
>>>>>>>>>>   ceph-osd --show-config | grep osd_scrub
>>>>>>>>>>
>>>>>>>>>> Thanks for testing this... I have a few other ideas to try to
>>>>>>>>>> reproduce.
>>>>>>>>>>
>>>>>>>>>> sage
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>>>>> in
>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Lo?c Dachary, Artisan Logiciel Libre
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html