On Fri, Jul 17, 2015 at 11:15 AM, Quentin Hartman <qhartman@xxxxxxxxxxxxxxxxxxx> wrote: > That looks a lot like what I was seeing initially. The OSDs getting marked > out was relatively rare and it took a bit before I saw it. Our problem is "most of the time" and does not appear confined to a specific ceph cluster node or OSD: $ sudo fgrep 'waiting for subops' ceph.log | sed -e 's/.* v4 //' | sort | uniq -c | sort -n 1 currently waiting for subops from 0 1 currently waiting for subops from 10 1 currently waiting for subops from 11 1 currently waiting for subops from 12 1 currently waiting for subops from 3 1 currently waiting for subops from 7 2 currently waiting for subops from 13 2 currently waiting for subops from 16 2 currently waiting for subops from 4 3 currently waiting for subops from 15 4 currently waiting for subops from 6 4 currently waiting for subops from 8 7 currently waiting for subops from 2 Node f16: 0, 2, and 3 (3 out of 4) Node f17: 4, 6, 7, 8, 10, 11, 12, 13 and 15 (9 out of 12) Node f18: 16 (1 out of 12) So f18 seems like the odd man out, in that it has *less* problems than the other two. There are a grand total of 2 RX errors across all the interfaces on all three machines. (Each one has dual 10G interfaces bonded together as active/failover.) The OSD log for the worst offender above (2) says: 2015-07-17 08:52:05.441607 7f562ea0c700 0 log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 30.119568 secs 2015-07-17 08:52:05.441622 7f562ea0c700 0 log [WRN] : slow request 30.119568 seconds old, received at 2015-07-17 08:51:35.321991: osd_sub_op(client.32913524.0:3149584 2.249 2792c249/rbd_data.15322ae8944a.000000000011b487/head//2 [] v 10705'944603 snapset=0=[]:[] snapc=0=[]) v11 currently started 2015-07-17 08:52:43.229770 7f560833f700 0 -- 192.168.2.216:6813/16029552 >> 192.168.2.218:6810/7028653 pipe(0x25265180 sd=25 :6813 s=2 pgs=23894 cs=41 l=0 c=0x22be4c60).fault with nothing to send, going to standby There are a bunch of those "fault with nothing to send, going to standby" messages. > The messages were like "So-and-so incorrectly marked us > out" IIRC. Nothing like that. Nor, with "ceph -w" running constantly, any reference to anything being marked out at any point, even when problems are severe. Thanks! _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com