Slow responding OSDs are not OUTed and cause RBD client IO hangs

Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> · Sat, 22 Aug 2015 21:17:24 -0400

Hello, this is an issue we have been suffering from and researching
along with a good number of other Ceph users, as evidenced by the
recent posts.  In our specific case, these issues manifest themselves
in a RBD -> iSCSI LIO -> ESXi configuration, but the problem is more
general.

When there is an issue on OSD nodes (examples: network hangs/blips,
disk HBAs failing, driver issues, page cache/XFS issues), some OSDs
respond slowly or with significant delays.  ceph osd perf does not
show this, neither does ceph osd tree, ceph -s / ceph -w.  Instead,
the RBD IO hangs to a point where the client times out, crashes or
displays other unsavory behavior - operationally this crashes
production processes.

Today in our lab we had a disk controller issue, which brought an OSD
node down.  Upon restart, the OSDs started up and rejoined into the
cluster.  However, immediately all IOs started hanging for a long time
and aborts from ESXi -> LIO were not succeeding in canceling these
IOs.  The only warning I could see was:

root@lab2-mon1:/var/log/ceph# ceph health detail
HEALTH_WARN 30 requests are blocked > 32 sec;
1 osds have slow requests 30 ops are blocked > 2097.15 sec
30 ops are blocked > 2097.15 sec on osd.4
1 osds have slow requests

However, ceph osd perf is not showing high latency on osd 4:

root@lab2-mon1:/var/log/ceph# ceph osd perf
osd fs_commit_latency(ms) fs_apply_latency(ms)
  0                     0                   13
  1                     0                    0
  2                     0                    0
  3                   172                  208
  4                     0                    0
  5                     0                    0
  6                     0                    1
  7                     0                    0
  8                   174                  819
  9                     6                   10
 10                     0                    1
 11                     0                    1
 12                     3                    5
 13                     0                    1
 14                     7                   23
 15                     0                    1
 16                     0                    0
 17                     5                    9
 18                     0                    1
 19                    10                   18
 20                     0                    0
 21                     0                    0
 22                     0                    1
 23                     5                   10

SMART state for osd 4 disk is OK.  The OSD in up and in:

root@lab2-mon1:/var/log/ceph# ceph osd tree
ID WEIGHT   TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
-8        0 root ssd
-7 14.71997 root platter
-3  7.12000     host croc3
22  0.89000         osd.22      up  1.00000          1.00000
15  0.89000         osd.15      up  1.00000          1.00000
16  0.89000         osd.16      up  1.00000          1.00000
13  0.89000         osd.13      up  1.00000          1.00000
18  0.89000         osd.18      up  1.00000          1.00000
 8  0.89000         osd.8       up  1.00000          1.00000
11  0.89000         osd.11      up  1.00000          1.00000
20  0.89000         osd.20      up  1.00000          1.00000
-4  0.47998     host croc2
10  0.06000         osd.10      up  1.00000          1.00000
12  0.06000         osd.12      up  1.00000          1.00000
14  0.06000         osd.14      up  1.00000          1.00000
17  0.06000         osd.17      up  1.00000          1.00000
19  0.06000         osd.19      up  1.00000          1.00000
21  0.06000         osd.21      up  1.00000          1.00000
 9  0.06000         osd.9       up  1.00000          1.00000
23  0.06000         osd.23      up  1.00000          1.00000
-2  7.12000     host croc1
 7  0.89000         osd.7       up  1.00000          1.00000
 2  0.89000         osd.2       up  1.00000          1.00000
 6  0.89000         osd.6       up  1.00000          1.00000
 1  0.89000         osd.1       up  1.00000          1.00000
 5  0.89000         osd.5       up  1.00000          1.00000
 0  0.89000         osd.0       up  1.00000          1.00000
 4  0.89000         osd.4       up  1.00000          1.00000
 3  0.89000         osd.3       up  1.00000          1.00000

How can we proactively detect this condition?  Is there anything I can
run that will output all slow OSDs?

Regards,
Alex
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com