Re: [0.48.3] OSD memory leak when scrubbing

Sage Weil <sage@xxxxxxxxxxx> · Mon, 4 Feb 2013 11:29:53 -0800 (PST)

On Mon, 4 Feb 2013, S?bastien Han wrote:
> Hum just tried several times on my test cluster and I can't get any
> core dump. Does Ceph commit suicide or something? Is it expected
> behavior?

SIGSEGV should trigger the usual path that dumps a stack trace and then 
dumps core.  Was your ulimit -c set before the daemon was started?

sage

> --
> Regards,
> S?bastien Han.
> 
> 
> On Sun, Feb 3, 2013 at 10:03 PM, S?bastien Han <han.sebastien@xxxxxxxxx> wrote:
> > Hi Lo?c,
> >
> > Thanks for bringing our discussion on the ML. I'll check that tomorrow :-).
> >
> > Cheer
> > --
> > Regards,
> > S?bastien Han.
> >
> >
> > On Sun, Feb 3, 2013 at 10:01 PM, S?bastien Han <han.sebastien@xxxxxxxxx> wrote:
> >> Hi Lo?c,
> >>
> >> Thanks for bringing our discussion on the ML. I'll check that tomorrow :-).
> >>
> >> Cheers
> >>
> >> --
> >> Regards,
> >> S?bastien Han.
> >>
> >>
> >> On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary <loic@xxxxxxxxxxx> wrote:
> >>>
> >>> Hi,
> >>>
> >>> As discussed during FOSDEM, the script you wrote to kill the OSD when it
> >>> grows too much could be amended to core dump instead of just being killed &
> >>> restarted. The binary + core could probably be used to figure out where the
> >>> leak is.
> >>>
> >>> You should make sure the OSD current working directory is in a file system
> >>> with enough free disk space to accomodate for the dump and set
> >>>
> >>> ulimit -c unlimited
> >>>
> >>> before running it ( your system default is probably ulimit -c 0 which
> >>> inhibits core dumps ). When you detect that OSD grows too much kill it with
> >>>
> >>> kill -SEGV $pid
> >>>
> >>> and upload the core found in the working directory, together with the
> >>> binary in a public place. If the osd binary is compiled with -g but without
> >>> changing the -O settings, you should have a larger binary file but no
> >>> negative impact on performances. Forensics analysis will be made a lot
> >>> easier with the debugging symbols.
> >>>
> >>> My 2cts
> >>>
> >>> On 01/31/2013 08:57 PM, Sage Weil wrote:
> >>> > On Thu, 31 Jan 2013, Sylvain Munaut wrote:
> >>> >> Hi,
> >>> >>
> >>> >> I disabled scrubbing using
> >>> >>
> >>> >>> ceph osd tell \* injectargs '--osd-scrub-min-interval 1000000'
> >>> >>> ceph osd tell \* injectargs '--osd-scrub-max-interval 10000000'
> >>> >>
> >>> >> and the leak seems to be gone.
> >>> >>
> >>> >> See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD memory
> >>> >> for the 12 osd processes over the last 3.5 days.
> >>> >> Memory was rising every 24h. I did the change yesterday around 13h00
> >>> >> and OSDs stopped growing. OSD memory even seems to go down slowly by
> >>> >> small blocks.
> >>> >>
> >>> >> Of course I assume disabling scrubbing is not a long term solution and
> >>> >> I should re-enable it ... (how do I do that btw ? what were the
> >>> >> default values for those parameters)
> >>> >
> >>> > It depends on the exact commit you're on.  You can see the defaults if
> >>> > you
> >>> > do
> >>> >
> >>> >  ceph-osd --show-config | grep osd_scrub
> >>> >
> >>> > Thanks for testing this... I have a few other ideas to try to reproduce.
> >>> >
> >>> > sage
> >>> > --
> >>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> >>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >>> --
> >>> Lo?c Dachary, Artisan Logiciel Libre
> >>>
> >>
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html