On Mon, Jul 9, 2012 at 10:04 AM, Yann Dupont <Yann.Dupont@xxxxxxxxxxxxxx> wrote: > Le 09/07/2012 18:54, Yann Dupont a écrit : > >> >> Ok. I've compiled the kernel this afternoon, and tested it without much >> success : >> >> Jul 9 18:17:23 label5.u14.univ-nantes.prive kernel: [ 284.116236] >> libceph: osd0 172.20.14.130:6801 socket closed >> Jul 9 18:17:43 label5.u14.univ-nantes.prive kernel: [ 304.101545] >> libceph: osd6 172.20.14.137:6800 socket closed >> Jul 9 18:17:53 label5.u14.univ-nantes.prive kernel: [ 314.095155] >> libceph: osd3 172.20.14.134:6800 socket closed >> Jul 9 18:18:38 label5.u14.univ-nantes.prive kernel: [ 359.075473] >> libceph: osd5 172.20.14.136:6800 socket closed >> Jul 9 18:19:48 label5.u14.univ-nantes.prive kernel: [ 429.107334] >> libceph: osd6 172.20.14.137:6800 socket closed > > > just an interesting thing I just noticed in the logs : > > osd-0.log > 2012-07-09 18:17:23.763925 7ff9fc19e700 0 bad crc in data 3071411075 != exp > 2231697357 > 2012-07-09 18:17:23.777607 7ff9fc19e700 0 -- 172.20.14.130:6801/5842 >> > 172.20.14.132:0/1974511416 pipe(0x2236c80 sd=38 pgs=0 cs=0 l=0).accept peer > addr is really 172.20.14.132:0/1974511416 (socket is 172.20.14.132:57972/0) > > osd-3.log > 2012-07-09 18:17:53.770111 7fe35461c700 0 bad crc in data 826922774 != exp > 2498450653 > 2012-07-09 18:17:53.770972 7fe35461c700 0 -- 172.20.14.134:6800/4495 >> > 172.20.14.132:0/1974511416 > pipe(0xa44ec80 sd=56 pgs=0 cs=0 l=0).accept peer addr is really > 172.20.14.132:0/1974511416 (socket > is 172.20.14.132:40726/0) > > osd-5.log > 2012-07-09 18:18:38.766417 7ff4a66cb700 0 bad crc in data 3949121728 != exp > 2496058560 > 2012-07-09 18:18:38.773386 7ff4a66cb700 0 -- 172.20.14.136:6800/4876 >> > 172.20.14.132:0/1974511416 pipe(0x20eeb780 sd=56 pgs=0 cs=0 l=0).accept peer > addr is really 172.20.14.132:0/1974511416 (socket is 172.20.14.132:57072/0) > > osd-6.log > 2012-07-09 18:17:43.765740 7fdf86b9d700 0 bad crc in data 2899452345 != exp > 2656886014 > 2012-07-09 18:17:43.772599 7fdf86b9d700 0 -- 172.20.14.137:6800/5260 >> > 172.20.14.132:0/1974511416 > pipe(0x1ec64780 sd=31 pgs=0 cs=0 l=0).accept peer addr is really > 172.20.14.132:0/1974511416 (socke > t is 172.20.14.132:48615/0) > > 2012-07-09 18:17:43.773170 7fdf8c718700 0 osd.6 347 pg[2.60( v 347'36181 > (337'35180,347'36181] n=4 > 144 ec=1 les/c 6/6 5/5/5) [6,7] r=0 lpr=5 mlcod 347'36180 active+clean] > watch: ctx->obc=0x102db340 > cookie=1 oi.version=36169 ctx->at_version=347'36182 > 2012-07-09 18:17:43.773209 7fdf8c718700 0 osd.6 347 pg[2.60( v 347'36181 > (337'35180,347'36181] n=4144 ec=1 les/c 6/6 5/5/5) [6,7] r=0 lpr=5 mlcod > 347'36180 active+clean] watch: oi.user_version=1559 > 2012-07-09 18:19:48.837952 7fdf86b9d700 0 bad crc in data 1231964953 != exp > 2305533436 > 2012-07-09 18:19:48.838850 7fdf86b9d700 0 -- 172.20.14.137:6800/5260 >> > 172.20.14.132:0/1974511416 pipe(0x1ec64c80 sd=31 pgs=0 cs=0 l=0).accept peer > addr is really 172.20.14.132:0/1974511416 (socket is 172.20.14.132:48618/0) > 2012-07-09 18:19:48.839493 7fdf8c718700 0 osd.6 347 pg[2.60( v 347'36192 > (337'35191,347'36192] n=4144 ec=1 les/c 6/6 5/5/5) [6,7] r=0 lpr=5 mlcod > 347'36191 active+clean] watch: ctx->obc=0x102db340 cookie=1 oi.version=36169 > ctx->at_version=347'36193 > 2012-07-09 18:19:48.839530 7fdf8c718700 0 osd.6 347 pg[2.60( v 347'36192 > (337'35191,347'36192] n=4144 ec=1 les/c 6/6 5/5/5) [6,7] r=0 lpr=5 mlcod > 347'36191 active+clean] watch: oi.user_version=1559 > > > Each time, at the exact date, a bad CRC (they are the only ones for this > day, so it seems related) Yes; a bad CRC should cause the socket to close — that's intended behavior (although you might want to look into why that's happening, since it's not something we've seen locally at all). Not handling that socket close is definitely a bug in the kernel that needs to get tracked down, though. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html