Having some issues with blocked ops on a small cluster. Running 0.94.5 with cache tiering. 3 cache nodes with 8 SSDs each and 3 spinning nodes with 12 spinning disk and journals. All the pools are 3x replicas. Started experiencing problems with OSDs in the cold tier consuming the entirety of the system memory (128GB) and then dying. Set flags to noup, restarted those and eventually got them back into the cluster. Since that point, I've been unable to get a healthy cluster. I've traced down some blocked ops that were waiting on other OSDs. Check the health of the hardware, restarted the OSDs but the problems seem to be moving around. I just found log messages like this: 2016-05-24 15:30:37.594651 7f90fcffb700 0 log_channel(cluster) log [WRN] : 36 slow requests, 2 included below; oldest blocked for > 31169.985102 secs 2016-05-24 15:30:37.594664 7f90fcffb700 0 log_channel(cluster) log [WRN] : slow request 30720.467611 seconds old, received at 2016-05-24 06:58:37.126070: osd_op(osd.58.3025:1750712 rbd_data.f60ea86c157272.000000000000182e@snapdir [list-snaps] 9.4c0312f1 ack+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected e6109) currently waiting for missing object 2016-05-24 15:30:37.594789 7f90fcffb700 0 log_channel(cluster) log [WRN] : slow request 30720.466730 seconds old, received at 2016-05-24 06:58:37.126951: osd_op(osd.58.3025:1750713 rbd_data.f60ea86c157272.000000000000182e [copy-get max 8388608] 9.4c0312f1 ack+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected e6109) currently waiting for missing object Any help narrowing this down would be appreciated. -H _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com