Re: [0.48.3] OSD memory leak when scrubbing

Sébastien Han <han.sebastien@xxxxxxxxx> · Mon, 4 Feb 2013 18:29:15 +0100



Hum just tried several times on my test cluster and I can't get any
core dump. Does Ceph commit suicide or something? Is it expected
behavior?
--
Regards,
Sébastien Han.


On Sun, Feb 3, 2013 at 10:03 PM, Sébastien Han <han.sebastien@xxxxxxxxx> wrote:
> Hi Loïc,
>
> Thanks for bringing our discussion on the ML. I'll check that tomorrow :-).
>
> Cheer
> --
> Regards,
> Sébastien Han.
>
>
> On Sun, Feb 3, 2013 at 10:01 PM, Sébastien Han <han.sebastien@xxxxxxxxx> wrote:
>> Hi Loïc,
>>
>> Thanks for bringing our discussion on the ML. I'll check that tomorrow :-).
>>
>> Cheers
>>
>> --
>> Regards,
>> Sébastien Han.
>>
>>
>> On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary <loic@xxxxxxxxxxx> wrote:
>>>
>>> Hi,
>>>
>>> As discussed during FOSDEM, the script you wrote to kill the OSD when it
>>> grows too much could be amended to core dump instead of just being killed &
>>> restarted. The binary + core could probably be used to figure out where the
>>> leak is.
>>>
>>> You should make sure the OSD current working directory is in a file system
>>> with enough free disk space to accomodate for the dump and set
>>>
>>> ulimit -c unlimited
>>>
>>> before running it ( your system default is probably ulimit -c 0 which
>>> inhibits core dumps ). When you detect that OSD grows too much kill it with
>>>
>>> kill -SEGV $pid
>>>
>>> and upload the core found in the working directory, together with the
>>> binary in a public place. If the osd binary is compiled with -g but without
>>> changing the -O settings, you should have a larger binary file but no
>>> negative impact on performances. Forensics analysis will be made a lot
>>> easier with the debugging symbols.
>>>
>>> My 2cts
>>>
>>> On 01/31/2013 08:57 PM, Sage Weil wrote:
>>> > On Thu, 31 Jan 2013, Sylvain Munaut wrote:
>>> >> Hi,
>>> >>
>>> >> I disabled scrubbing using
>>> >>
>>> >>> ceph osd tell \* injectargs '--osd-scrub-min-interval 1000000'
>>> >>> ceph osd tell \* injectargs '--osd-scrub-max-interval 10000000'
>>> >>
>>> >> and the leak seems to be gone.
>>> >>
>>> >> See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD memory
>>> >> for the 12 osd processes over the last 3.5 days.
>>> >> Memory was rising every 24h. I did the change yesterday around 13h00
>>> >> and OSDs stopped growing. OSD memory even seems to go down slowly by
>>> >> small blocks.
>>> >>
>>> >> Of course I assume disabling scrubbing is not a long term solution and
>>> >> I should re-enable it ... (how do I do that btw ? what were the
>>> >> default values for those parameters)
>>> >
>>> > It depends on the exact commit you're on.  You can see the defaults if
>>> > you
>>> > do
>>> >
>>> >  ceph-osd --show-config | grep osd_scrub
>>> >
>>> > Thanks for testing this... I have a few other ideas to try to reproduce.
>>> >
>>> > sage
>>> > --
>>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> > the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html