ceph-osd leak

Serge Slipchenko <serge.slipchenko@xxxxxxxxx> · Fri, 20 Sep 2013 16:40:23 +0300

Hi,
I'm using CephFS 0.67.3 as a backend for Hypertable and ElasticSearch. Active reading/writing to the cephfs causes uncontrolled OSD memory growth and at the final stage swapping and server unavailability.

To keep the cluster in working condition I have to restart OSD's with excessive memory consumption.
This is definitely wrong, but I hope it will help to understand problem.

One of the nodes scrubbing, go series of faults from MON and OSD is restarted by the memory guard script.

2013-09-20 10:54:39.901871 7f74374a0700  0 log [INF] : 5.e0 scrub ok
2013-09-20 10:56:50.563862 7f74374a0700  0 log [INF] : 1.27 scrub ok
2013-09-20 11:00:03.159553 7f742c826700  0 -- 5.9.136.227:6801/1389 >> 5.9.136.227:6805/1510 pipe(0x97fcc80 sd=72 :6801 s=0 pgs=0 cs=0 l=0 c=0x9889000).accept connect_seq 2 vs existing 1 stat

e standby
2013-09-20 11:00:04.935305 7f7433685700  0 -- 5.9.136.227:6801/1389 >> 144.76.13.103:6801/1771 pipe(0x963b000 sd=63 :56878 s=2 pgs=41599 cs=553 l=0 c=0x9679160).fault with nothing to send, go

ing to standby
2013-09-20 11:00:04.986654 7f742c725700  0 -- 5.9.136.227:0/1389 >> 144.76.13.103:6803/1771 pipe(0x9859780 sd=240 :0 s=1 pgs=0 cs=0 l=1 c=0xb2b1b00).fault

2013-09-20 11:00:04.986662 7f7430157700  0 -- 5.9.136.227:0/1389 >> 144.76.13.103:6802/1771 pipe(0xbbf4780 sd=144 :0 s=1 pgs=0 cs=0 l=1 c=0xa89b000).fault

2013-09-20 11:03:23.499091 7f7432379700  0 -- 5.9.136.227:6801/1389 >> 144.76.13.103:6801/17989 pipe(0xb2d0500 sd=230 :6801 s=0 pgs=0 cs=0 l=0 c=0xa89b6e0).accept connect_seq 46 vs existing 0

 state connecting
2013-09-20 11:03:23.499704 7f7432379700  0 -- 5.9.136.227:6801/1389 >> 144.76.13.103:6801/17989 pipe(0xb2d0500 sd=230 :6801 s=1 pgs=2107 cs=47 l=0 c=0xf247580).fault

2013-09-20 11:03:23.505559 7f7431369700  0 -- 5.9.136.227:6801/1389 >> 144.76.13.103:6801/17989 pipe(0x9874c80 sd=230 :6801 s=0 pgs=0 cs=0 l=0 c=0xa89b000).accept connect_seq 1 vs existing 47

 state connecting
2013-09-20 11:15:03.239657 7f742c826700  0 -- 5.9.136.227:6801/1389 >> 5.9.136.227:6805/1510 pipe(0x97fcc80 sd=72 :6801 s=2 pgs=1297 cs=3 l=0 c=0x9855b00).fault with nothing to send, going to

 standby

A similar chain of events is repeated on different servers with regularity of 2 hours.

It looks similar to the old bug http://tracker.ceph.com/issues/3883 , but I'm using plain log files. 

Is it anything well known or something new?

-- 
Kind regards, Serge Slipchenko

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com