0.56.3 OSDs wrongly marked down and cluster unresponsiveness

Nick Bartos <nick@xxxxxxxxxxxxxxx> · Thu, 28 Feb 2013 13:44:28 -0800

When a single high I/O event (in this case a cp of a 10G file on a
filesystem mounted on an rbd) occurs, I'm having the 2 OSDs that
reside on the same system where the rbd is mounted being marked down
when it appears that they shouldn't be.  Additionally, other cluster
services start timing out just after the OSDs are marked down (which
appears to be mysql which is rbd backed becoming unresponsive).
Previous to the OSDs being marked down things are running slow, but do
not appear to actually be failing until they are actually marked down.

Here is an interesting snip in the logs:

Feb 28 21:12:11 172.17.0.13 ceph-mon: 2013-02-28 21:12:11.081003
7f377687d700  1 mon.0@0(leader).osd e14  we have enough
reports/reporters to mark osd.2 down
Feb 28 21:12:11 172.17.0.13 [  663.241832] libceph: osd2 down
Feb 28 21:12:11 172.17.0.14 [  655.577185] libceph: osd2 down
Feb 28 21:12:11 172.17.0.13 [  663.242064] libceph: osd5 down
Feb 28 21:12:11 172.17.0.13 kernel: [  663.241832] libceph: osd2 down
Feb 28 21:12:11 172.17.0.13 kernel: [  663.242064] libceph: osd5 down
Feb 28 21:12:11 172.17.0.14 [  655.577434] libceph: osd5 down
Feb 28 21:12:11 172.17.0.14 kernel: [  655.577185] libceph: osd2 down
Feb 28 21:12:11 172.17.0.14 kernel: [  655.577434] libceph: osd5 down
Feb 28 21:12:12 172.17.0.13 ceph-osd: 2013-02-28 21:12:12.423178 osd.5
172.17.0.13:6803/2015 126 : [WRN] map e16 wrongly marked me down
Feb 28 21:12:12 172.17.0.13 ceph-osd: 2013-02-28 21:12:12.423177
7f4c10a0e700  0 log [WRN] : map e16 wrongly marked me down
Feb 28 21:12:17 172.17.0.13 ceph-osd: 2013-02-28 21:12:17.208466
7f01aa894700  0 log [WRN] : map e16 wrongly marked me down
Feb 28 21:12:17 172.17.0.13 ceph-osd: 2013-02-28 21:12:17.208468 osd.2
172.17.0.13:6800/1924 187 : [WRN] map e16 wrongly marked me down

The full log is available here:  http://download.pistoncloud.com/p/ceph-2.log.xz
Note: The compressed log is only about 8MB, but uncompressed it's
about 160MB.  I've added libceph and rbd kernel debugging in as well.

Here's a brief run-down of the configuration:
ceph 0.56.3
kernel 3.5.7 with 116 patches I got from Alex Elder
4 x nodes with 2 x OSDs per node (each is a 120GB SSD, using xfs)
1gbit networking between nodes
rbds mapped on all nodes running OSDs, and xfs filesystems running on
top of them

I'm not sure if it would help, but are there any updated ceph patches
for the 3.5.x kernel?
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com