Re: ceph-osd leak

Gregory Farnum <greg@xxxxxxxxxxx> · Tue, 24 Sep 2013 09:47:14 -0700



On Sun, Sep 22, 2013 at 10:00 AM, Serge Slipchenko
<serge.slipchenko@xxxxxxxxx> wrote:
> On Fri, Sep 20, 2013 at 11:44 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>>
>> [ Re-added the list — please keep emails on there so everybody can
>> benefit! ]
>>
>> On Fri, Sep 20, 2013 at 12:24 PM, Serge Slipchenko
>> <serge.slipchenko@xxxxxxxxx> wrote:
>> >
>> >
>> >
>> > On Fri, Sep 20, 2013 at 5:59 PM, Gregory Farnum <greg@xxxxxxxxxxx>
>> > wrote:
>> >>
>> >> On Fri, Sep 20, 2013 at 6:40 AM, Serge Slipchenko
>> >> <serge.slipchenko@xxxxxxxxx> wrote:
>> >> > Hi,
>> >> >
>> >> > I'm using CephFS 0.67.3 as a backend for Hypertable and
>> >> > ElasticSearch.
>> >> > Active reading/writing to the cephfs causes uncontrolled OSD memory
>> >> > growth
>> >> > and at the final stage swapping and server unavailability.
>> >>
>> >> What kind of memory growth are you seeing?
>> >
>> > 10-20Gb
>> >
>> >>
>> >> > To keep the cluster in working condition I have to restart OSD's with
>> >> > excessive memory consumption.
>> >> > This is definitely wrong, but I hope it will help to understand
>> >> > problem.
>> >> >
>> >> > One of the nodes scrubbing, go series of faults from MON and OSD is
>> >> > restarted by the memory guard script.
>> >>
>> >> What makes you think a monitor is involved? The log below doesn't look
>> >> like a monitor unless you've done something strange with your config
>> >> (wrong ports).
>> >
>> > Yes, I am somewhat inaccurate. I mean 144.76.13.103  is also a monitor
>> > node.
>> >
>> >>
>> >> > 2013-09-20 10:54:39.901871 7f74374a0700  0 log [INF] : 5.e0 scrub ok
>> >> > 2013-09-20 10:56:50.563862 7f74374a0700  0 log [INF] : 1.27 scrub ok
>> >> > 2013-09-20 11:00:03.159553 7f742c826700  0 -- 5.9.136.227:6801/1389
>> >> > >>
>> >> > 5.9.136.227:6805/1510 pipe(0x97fcc80 sd=72 :6801 s=0 pgs=0 cs=0 l=0
>> >> > c=0x9889000).accept connect_seq 2 vs existing 1 stat
>> >> > e standby
>> >> > 2013-09-20 11:00:04.935305 7f7433685700  0 -- 5.9.136.227:6801/1389
>> >> > >>
>> >> > 144.76.13.103:6801/1771 pipe(0x963b000 sd=63 :56878 s=2 pgs=41599
>> >> > cs=553
>> >> > l=0
>> >> > c=0x9679160).fault with nothing to send, go
>> >> > ing to standby
>> >> > 2013-09-20 11:00:04.986654 7f742c725700  0 -- 5.9.136.227:0/1389 >>
>> >> > 144.76.13.103:6803/1771 pipe(0x9859780 sd=240 :0 s=1 pgs=0 cs=0 l=1
>> >> > c=0xb2b1b00).fault
>> >> > 2013-09-20 11:00:04.986662 7f7430157700  0 -- 5.9.136.227:0/1389 >>
>> >> > 144.76.13.103:6802/1771 pipe(0xbbf4780 sd=144 :0 s=1 pgs=0 cs=0 l=1
>> >> > c=0xa89b000).fault
>> >> > 2013-09-20 11:03:23.499091 7f7432379700  0 -- 5.9.136.227:6801/1389
>> >> > >>
>> >> > 144.76.13.103:6801/17989 pipe(0xb2d0500 sd=230 :6801 s=0 pgs=0 cs=0
>> >> > l=0
>> >> > c=0xa89b6e0).accept connect_seq 46 vs existing 0
>> >> >  state connecting
>> >> > 2013-09-20 11:03:23.499704 7f7432379700  0 -- 5.9.136.227:6801/1389
>> >> > >>
>> >> > 144.76.13.103:6801/17989 pipe(0xb2d0500 sd=230 :6801 s=1 pgs=2107
>> >> > cs=47
>> >> > l=0
>> >> > c=0xf247580).fault
>> >> > 2013-09-20 11:03:23.505559 7f7431369700  0 -- 5.9.136.227:6801/1389
>> >> > >>
>> >> > 144.76.13.103:6801/17989 pipe(0x9874c80 sd=230 :6801 s=0 pgs=0 cs=0
>> >> > l=0
>> >> > c=0xa89b000).accept connect_seq 1 vs existing 47
>> >> >  state connecting
>> >> > 2013-09-20 11:15:03.239657 7f742c826700  0 -- 5.9.136.227:6801/1389
>> >> > >>
>> >> > 5.9.136.227:6805/1510 pipe(0x97fcc80 sd=72 :6801 s=2 pgs=1297 cs=3
>> >> > l=0
>> >> > c=0x9855b00).fault with nothing to send, going to
>> >> >  standby
>> >> >
>> >> > A similar chain of events is repeated on different servers with
>> >> > regularity
>> >> > of 2 hours.
>> >> >
>> >> > It looks similar to the old bug http://tracker.ceph.com/issues/3883 ,
>> >> > but
>> >> > I'm using plain log files.
>> >>
>> >> Not if your issue is correlated with writes rather than scrubs. :)
>> >
>> > Could those problems be caused by slow network?
>> >
>> >>
>> >> > Is it anything well known or something new?
>> >>
>> >> Nobody's reported anything like it yet.
>> >> In addition to the above, we'll also need to know about your cluster.
>> >> How many nodes, what does each look like, what's your network look
>> >> like, what OS and where did you get your Ceph packages?
>> >
>> > I have 8 servers connected via 1Gb network, but for some servers actual
>> > speed is 100-200Mb.
>>
>> Well, yeah, that'll do it. 200Mb/s is only ~25MB/s, which is much
>> slower than your servers can write to disk. So your machines with
>> faster network are ingesting data and putting it on disk much more
>> quickly than they can replicate it to the servers with slower network
>> connections and the replication messages are just getting queued up in
>> RAM. Ceph is designed so you can make it work with async hardware but
>> making it work well with an async network is going to be more
>> challenging.
>
> Yes, it looks like servers that have 800Mb and higher connections never have
> memory problems.
>
>>
>> You can play around with a couple different things to try and make this
>> better:
>> 1) Make the weight of the nodes proportional to their bandwidth.
>
> Am I correct that lower weight means less I/O impact?

Yep! The weight controls how much of the cluster's data is stored on
the OSD, which is directly proportional to how much IO it gets.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com