kernel: libceph socket closed (con state OPEN)

Daniel van Ham Colchete <daniel.colchete@xxxxxxxxx> · Wed, 10 Jun 2015 07:37:34 -0300

Hello everyone!
I have been doing some log analysis on my systems here, trying to detect problems before they affect my users. 

One thing I have found is that I have been seeing a lot of those logs here:
Jun 10 06:47:09 10.3.1.1 kernel: [2960203.682638] libceph: osd2 10.3.1.2:6800 socket closed (con state OPEN)
Jun 10 06:47:09 10.3.1.1 kernel: [2960203.711368] libceph: osd2 10.3.1.2:6800 socket closed (con state OPEN)
Jun 10 06:48:18 10.3.1.1 kernel: [2960272.959980] libceph: osd16 10.3.1.16:6800 socket closed (con state OPEN)
Jun 10 06:54:25 10.3.1.1 kernel: [2960640.085364] libceph: osd7 10.3.1.7:6800 socket closed (con state OPEN)
Jun 10 06:54:31 10.3.1.1 kernel: [2960646.704091] libceph: osd16 10.3.1.16:6800 socket closed (con state OPEN)
Jun 10 06:57:32 10.3.1.1 kernel: [2960826.966644] libceph: osd16 10.3.1.16:6800 socket closed (con state OPEN)
Jun 10 06:59:28 10.3.1.1 kernel: [2960943.428968] libceph: osd2 10.3.1.2:6800 socket closed (con state OPEN)

That server has a mapped RBD device on the Kernel. I happens more often than a timeout would allow it. For exemple, if I filter only one osd I would get:
Jun 10 05:06:40 10.3.1.1 kernel: [2954172.128634] libceph: osd3 10.3.1.3:6800 socket closed (con state OPEN)
Jun 10 05:07:01 10.3.1.1 kernel: [2954193.525812] libceph: osd3 10.3.1.3:6800 socket closed (con state OPEN)
Jun 10 05:07:41 10.3.1.1 kernel: [2954233.677509] libceph: osd3 10.3.1.3:6800 socket closed (con state OPEN)
Jun 10 05:57:51 10.3.1.1 kernel: [2957244.399411] libceph: osd3 10.3.1.3:6800 socket closed (con state OPEN)
Jun 10 06:01:19 10.3.1.1 kernel: [2957452.424746] libceph: osd3 10.3.1.3:6800 socket closed (con state OPEN)
Jun 10 06:25:50 10.3.1.1 kernel: [2958924.605430] libceph: osd3 10.3.1.3:6800 socket closed (con state OPEN)
Jun 10 06:39:46 10.3.1.1 kernel: [2959760.682579] libceph: osd3 10.3.1.3:6800 socket closed (con state OPEN)
Jun 10 07:26:03 10.3.1.1 kernel: [2962539.405359] libceph: osd3 10.3.1.3:6800 socket closed (con state OPEN)

On my OSDs the log I see is:
2015-06-10 05:06:40.200895 7f71ff31a700  0 bad crc in data 1002098477 != exp 3174942904
2015-06-10 05:06:40.205817 7f71ff31a700  0 -- 10.3.1.3:6800/6257 >> 10.3.1.1:0/3647655452 pipe(0x1d6f5000 sd=28 :6800 s=0 pgs=0 cs=0 l=0 c=0x25521340).accept peer addr is really 10.3.1.1:0/3647655452 (socket is 10.3.1.1:53763/0)

2015-06-10 05:07:01.583206 7f71ff31a700  0 bad crc in data 3174714858 != exp 1350262707
2015-06-10 05:07:01.592705 7f71ff31a700  0 -- 10.3.1.3:6800/6257 >> 10.3.1.1:0/3647655452 pipe(0x1d6e5000 sd=28 :6800 s=0 pgs=0 cs=0 l=0 c=0x15b5f8c0).accept peer addr is really 10.3.1.1:0/3647655452 (socket is 10.3.1.1:53908/0)
2015-06-10 05:07:01.597651 7f720d577700  0 -- 10.3.1.3:6800/6257 submit_message osd_op_reply(46327515 rb.0.22937.6b8b4567.0000000077fd [write 262144~524288] v33657'1078938 uv1078938 _ondisk_ = 0) v6 remote, 10.3.1.1:0/3647655452, failed lossy con, dropping message 0x977b8c0
2015-06-10 05:07:01.602411 7f720ad72700  0 -- 10.3.1.3:6800/6257 submit_message osd_op_reply(46327516 rb.0.22937.6b8b4567.0000000077fd [write 786432~524288] v33657'1078939 uv1078939 _ondisk_ = 0) v6 remote, 10.3.1.1:0/3647655452, failed lossy con, dropping message 0x175958c0

2015-06-10 05:57:51.138059 7f71ff31a700  0 bad crc in data 4016280483 != exp 3125104237
2015-06-10 05:57:51.138653 7f71ff31a700  0 -- 10.3.1.3:6800/6257 >> 10.3.1.1:0/3647655452 pipe(0x17d2e000 sd=69 :6800 s=0 pgs=0 cs=0 l=0 c=0x157fb340).accept peer addr is really 10.3.1.1:0/3647655452 (socket is 10.3.1.1:50010/0)

And so on for every log of connection closed. 

We are running ceph 0.94.1-1~bpo70+1 on a Debian Wheezy userspace with kernel 3.14.15.

My ceph.conf is simple:
[global]
        auth cluster required = none
        auth service required = none
        auth client required = none
        public network = 10.3.0.0/16
        cluster network = 10.3.0.0/16
        mon force standby active = true
        rbd cache = true
(plus hosts for the 16 osds)

Ceph is very robust and it comes with an awesome error handling inside. This error is not affecting my users right now. But I'm worried if this could change in the future.

Best,
Daniel Colchete
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com