Re: ceph-osd leak

Gregory Farnum <greg@xxxxxxxxxxx> · Fri, 20 Sep 2013 07:59:24 -0700

On Fri, Sep 20, 2013 at 6:40 AM, Serge Slipchenko
<serge.slipchenko@xxxxxxxxx> wrote:
> Hi,
>
> I'm using CephFS 0.67.3 as a backend for Hypertable and ElasticSearch.
> Active reading/writing to the cephfs causes uncontrolled OSD memory growth
> and at the final stage swapping and server unavailability.

What kind of memory growth are you seeing?

> To keep the cluster in working condition I have to restart OSD's with
> excessive memory consumption.
> This is definitely wrong, but I hope it will help to understand problem.
>
> One of the nodes scrubbing, go series of faults from MON and OSD is
> restarted by the memory guard script.

What makes you think a monitor is involved? The log below doesn't look
like a monitor unless you've done something strange with your config
(wrong ports).

> 2013-09-20 10:54:39.901871 7f74374a0700  0 log [INF] : 5.e0 scrub ok
> 2013-09-20 10:56:50.563862 7f74374a0700  0 log [INF] : 1.27 scrub ok
> 2013-09-20 11:00:03.159553 7f742c826700  0 -- 5.9.136.227:6801/1389 >>
> 5.9.136.227:6805/1510 pipe(0x97fcc80 sd=72 :6801 s=0 pgs=0 cs=0 l=0
> c=0x9889000).accept connect_seq 2 vs existing 1 stat
> e standby
> 2013-09-20 11:00:04.935305 7f7433685700  0 -- 5.9.136.227:6801/1389 >>
> 144.76.13.103:6801/1771 pipe(0x963b000 sd=63 :56878 s=2 pgs=41599 cs=553 l=0
> c=0x9679160).fault with nothing to send, go
> ing to standby
> 2013-09-20 11:00:04.986654 7f742c725700  0 -- 5.9.136.227:0/1389 >>
> 144.76.13.103:6803/1771 pipe(0x9859780 sd=240 :0 s=1 pgs=0 cs=0 l=1
> c=0xb2b1b00).fault
> 2013-09-20 11:00:04.986662 7f7430157700  0 -- 5.9.136.227:0/1389 >>
> 144.76.13.103:6802/1771 pipe(0xbbf4780 sd=144 :0 s=1 pgs=0 cs=0 l=1
> c=0xa89b000).fault
> 2013-09-20 11:03:23.499091 7f7432379700  0 -- 5.9.136.227:6801/1389 >>
> 144.76.13.103:6801/17989 pipe(0xb2d0500 sd=230 :6801 s=0 pgs=0 cs=0 l=0
> c=0xa89b6e0).accept connect_seq 46 vs existing 0
>  state connecting
> 2013-09-20 11:03:23.499704 7f7432379700  0 -- 5.9.136.227:6801/1389 >>
> 144.76.13.103:6801/17989 pipe(0xb2d0500 sd=230 :6801 s=1 pgs=2107 cs=47 l=0
> c=0xf247580).fault
> 2013-09-20 11:03:23.505559 7f7431369700  0 -- 5.9.136.227:6801/1389 >>
> 144.76.13.103:6801/17989 pipe(0x9874c80 sd=230 :6801 s=0 pgs=0 cs=0 l=0
> c=0xa89b000).accept connect_seq 1 vs existing 47
>  state connecting
> 2013-09-20 11:15:03.239657 7f742c826700  0 -- 5.9.136.227:6801/1389 >>
> 5.9.136.227:6805/1510 pipe(0x97fcc80 sd=72 :6801 s=2 pgs=1297 cs=3 l=0
> c=0x9855b00).fault with nothing to send, going to
>  standby
>
> A similar chain of events is repeated on different servers with regularity
> of 2 hours.
>
> It looks similar to the old bug http://tracker.ceph.com/issues/3883 , but
> I'm using plain log files.

Not if your issue is correlated with writes rather than scrubs. :)

> Is it anything well known or something new?

Nobody's reported anything like it yet.
In addition to the above, we'll also need to know about your cluster.
How many nodes, what does each look like, what's your network look
like, what OS and where did you get your Ceph packages?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com