blocked ops

Heath Albritton <halbritt@xxxxxxxx> · Tue, 24 May 2016 08:52:07 -0700

Having some issues with blocked ops on a small cluster.  Running
0.94.5 with cache tiering.  3 cache nodes with 8 SSDs each and 3
spinning nodes with 12 spinning disk and journals.  All the pools are
3x replicas.

Started experiencing problems with OSDs in the cold tier consuming the
entirety of the system memory (128GB) and then dying.  Set flags to
noup, restarted those and eventually got them back into the cluster.

Since that point, I've been unable to get a healthy cluster.  I've
traced down some blocked ops that were waiting on other OSDs.  Check
the health of the hardware, restarted the OSDs but the problems seem
to be moving around.  I just found log messages like this:

2016-05-24 15:30:37.594651 7f90fcffb700  0 log_channel(cluster) log
[WRN] : 36 slow requests, 2 included below; oldest blocked for >
31169.985102 secs
2016-05-24 15:30:37.594664 7f90fcffb700  0 log_channel(cluster) log
[WRN] : slow request 30720.467611 seconds old, received at 2016-05-24
06:58:37.126070: osd_op(osd.58.3025:1750712
rbd_data.f60ea86c157272.000000000000182e@snapdir [list-snaps]
9.4c0312f1 ack+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
e6109) currently waiting for missing object
2016-05-24 15:30:37.594789 7f90fcffb700  0 log_channel(cluster) log
[WRN] : slow request 30720.466730 seconds old, received at 2016-05-24
06:58:37.126951: osd_op(osd.58.3025:1750713
rbd_data.f60ea86c157272.000000000000182e [copy-get max 8388608]
9.4c0312f1 ack+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
e6109) currently waiting for missing object

Any help narrowing this down would be appreciated.

-H
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com