Re: ceph-osd leak

Serge Slipchenko <serge.slipchenko@xxxxxxxxx> · Sun, 22 Sep 2013 20:00:30 +0300

On Fri, Sep 20, 2013 at 11:44 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:

[ Re-added the list — please keep emails on there so everybody can benefit! ]

On Fri, Sep 20, 2013 at 12:24 PM, Serge Slipchenko

<serge.slipchenko@xxxxxxxxx> wrote:

>

>

>

> On Fri, Sep 20, 2013 at 5:59 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:

>>

>> On Fri, Sep 20, 2013 at 6:40 AM, Serge Slipchenko

>> <serge.slipchenko@xxxxxxxxx> wrote:

>> > Hi,

>> >

>> > I'm using CephFS 0.67.3 as a backend for Hypertable and ElasticSearch.

>> > Active reading/writing to the cephfs causes uncontrolled OSD memory

>> > growth

>> > and at the final stage swapping and server unavailability.

>>

>> What kind of memory growth are you seeing?

>

> 10-20Gb

>

>>

>> > To keep the cluster in working condition I have to restart OSD's with

>> > excessive memory consumption.

>> > This is definitely wrong, but I hope it will help to understand problem.

>> >

>> > One of the nodes scrubbing, go series of faults from MON and OSD is

>> > restarted by the memory guard script.

>>

>> What makes you think a monitor is involved? The log below doesn't look

>> like a monitor unless you've done something strange with your config

>> (wrong ports).

>

> Yes, I am somewhat inaccurate. I mean 144.76.13.103  is also a monitor node.

>

>>

>> > 2013-09-20 10:54:39.901871 7f74374a0700  0 log [INF] : 5.e0 scrub ok

>> > 2013-09-20 10:56:50.563862 7f74374a0700  0 log [INF] : 1.27 scrub ok

>> > 2013-09-20 11:00:03.159553 7f742c826700  0 -- 5.9.136.227:6801/1389 >>

>> > 5.9.136.227:6805/1510 pipe(0x97fcc80 sd=72 :6801 s=0 pgs=0 cs=0 l=0

>> > c=0x9889000).accept connect_seq 2 vs existing 1 stat

>> > e standby

>> > 2013-09-20 11:00:04.935305 7f7433685700  0 -- 5.9.136.227:6801/1389 >>

>> > 144.76.13.103:6801/1771 pipe(0x963b000 sd=63 :56878 s=2 pgs=41599 cs=553

>> > l=0

>> > c=0x9679160).fault with nothing to send, go

>> > ing to standby

>> > 2013-09-20 11:00:04.986654 7f742c725700  0 -- 5.9.136.227:0/1389 >>

>> > 144.76.13.103:6803/1771 pipe(0x9859780 sd=240 :0 s=1 pgs=0 cs=0 l=1

>> > c=0xb2b1b00).fault

>> > 2013-09-20 11:00:04.986662 7f7430157700  0 -- 5.9.136.227:0/1389 >>

>> > 144.76.13.103:6802/1771 pipe(0xbbf4780 sd=144 :0 s=1 pgs=0 cs=0 l=1

>> > c=0xa89b000).fault

>> > 2013-09-20 11:03:23.499091 7f7432379700  0 -- 5.9.136.227:6801/1389 >>

>> > 144.76.13.103:6801/17989 pipe(0xb2d0500 sd=230 :6801 s=0 pgs=0 cs=0 l=0

>> > c=0xa89b6e0).accept connect_seq 46 vs existing 0

>> >  state connecting

>> > 2013-09-20 11:03:23.499704 7f7432379700  0 -- 5.9.136.227:6801/1389 >>

>> > 144.76.13.103:6801/17989 pipe(0xb2d0500 sd=230 :6801 s=1 pgs=2107 cs=47

>> > l=0

>> > c=0xf247580).fault

>> > 2013-09-20 11:03:23.505559 7f7431369700  0 -- 5.9.136.227:6801/1389 >>

>> > 144.76.13.103:6801/17989 pipe(0x9874c80 sd=230 :6801 s=0 pgs=0 cs=0 l=0

>> > c=0xa89b000).accept connect_seq 1 vs existing 47

>> >  state connecting

>> > 2013-09-20 11:15:03.239657 7f742c826700  0 -- 5.9.136.227:6801/1389 >>

>> > 5.9.136.227:6805/1510 pipe(0x97fcc80 sd=72 :6801 s=2 pgs=1297 cs=3 l=0

>> > c=0x9855b00).fault with nothing to send, going to

>> >  standby

>> >

>> > A similar chain of events is repeated on different servers with

>> > regularity

>> > of 2 hours.

>> >

>> > It looks similar to the old bug http://tracker.ceph.com/issues/3883 ,

>> > but

>> > I'm using plain log files.

>>

>> Not if your issue is correlated with writes rather than scrubs. :)

>

> Could those problems be caused by slow network?

>

>>

>> > Is it anything well known or something new?

>>

>> Nobody's reported anything like it yet.

>> In addition to the above, we'll also need to know about your cluster.

>> How many nodes, what does each look like, what's your network look

>> like, what OS and where did you get your Ceph packages?

>

> I have 8 servers connected via 1Gb network, but for some servers actual

> speed is 100-200Mb.

Well, yeah, that'll do it. 200Mb/s is only ~25MB/s, which is much

slower than your servers can write to disk. So your machines with

faster network are ingesting data and putting it on disk much more

quickly than they can replicate it to the servers with slower network

connections and the replication messages are just getting queued up in

RAM. Ceph is designed so you can make it work with async hardware but

making it work well with an async network is going to be more

challenging.
Yes, it looks like servers that have 800Mb and higher connections never have memory problems.

You can play around with a couple different things to try and make this better:

1) Make the weight of the nodes proportional to their bandwidth.
Am I correct that lower weight means less I/O impact?

2) Play around with the message throttlers, especially for the

clients. The aggregate amount of in-progress data the servers will

allow in from clients is bounded by this value (multiplied by number

of servers, etc).

-Greg

-- 
Kind regards, Serge Slipchenko

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com