Re: Deadly slow Ceph cluster revisited

J David <j.david.lists@xxxxxxxxx> · Fri, 17 Jul 2015 11:57:57 -0400

On Fri, Jul 17, 2015 at 11:15 AM, Quentin Hartman
<qhartman@xxxxxxxxxxxxxxxxxxx> wrote:
> That looks a lot like what I was seeing initially. The OSDs getting marked
> out was relatively rare and it took a bit before I saw it.

Our problem is "most of the time" and does not appear confined to a
specific ceph cluster node or OSD:

$ sudo fgrep 'waiting for subops' ceph.log  | sed -e 's/.* v4 //' |
sort | uniq -c | sort -n
      1 currently waiting for subops from 0
      1 currently waiting for subops from 10
      1 currently waiting for subops from 11
      1 currently waiting for subops from 12
      1 currently waiting for subops from 3
      1 currently waiting for subops from 7
      2 currently waiting for subops from 13
      2 currently waiting for subops from 16
      2 currently waiting for subops from 4
      3 currently waiting for subops from 15
      4 currently waiting for subops from 6
      4 currently waiting for subops from 8
      7 currently waiting for subops from 2

Node f16: 0, 2, and 3 (3 out of 4)
Node f17: 4, 6, 7, 8, 10, 11, 12, 13 and 15 (9 out of 12)
Node f18: 16 (1 out of 12)

So f18 seems like the odd man out, in that it has *less* problems than
the other two.

There are a grand total of 2 RX errors across all the interfaces on
all three machines. (Each one has dual 10G interfaces bonded together
as active/failover.)

The OSD log for the worst offender above (2) says:

2015-07-17 08:52:05.441607 7f562ea0c700  0 log [WRN] : 1 slow
requests, 1 included below; oldest blocked for > 30.119568 secs

2015-07-17 08:52:05.441622 7f562ea0c700  0 log [WRN] : slow request
30.119568 seconds old, received at 2015-07-17 08:51:35.321991:
osd_sub_op(client.32913524.0:3149584 2.249
2792c249/rbd_data.15322ae8944a.000000000011b487/head//2 [] v
10705'944603 snapset=0=[]:[] snapc=0=[]) v11 currently started

2015-07-17 08:52:43.229770 7f560833f700  0 --
192.168.2.216:6813/16029552 >> 192.168.2.218:6810/7028653
pipe(0x25265180 sd=25 :6813 s=2 pgs=23894 cs=41 l=0
c=0x22be4c60).fault with nothing to send, going to standby

There are a bunch of those "fault with nothing to send, going to
standby" messages.

> The messages were like "So-and-so incorrectly marked us
> out" IIRC.

Nothing like that. Nor, with "ceph -w" running constantly, any
reference to anything being marked out at any point, even when
problems are severe.

Thanks!
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com