Dear Ceph Experts, our Ceph cluster suddenly went into a state of OSDs constantly having blocked or slow requests, rendering the cluster unusable. This happened during normal use, there were no updates, etc. All disks seem to be healthy (smartctl, iostat, etc.). A complete hardware reboot including system update on all nodes has not helped. The network equipment also shows no trouble. We'd be glad for any advice on how to diagnose and solve this, as the cluster is basically at a standstill and we urgently need to get it back into operation. Cluster structure: 6 Nodes, 6x 3TB disks plus 1x System/Journal SSD per node, one OSD per disk. We're running ceph version 0.67.4-1precise on Ubuntu 12.04.3 with kernel 3.8.0-33-generic (x86_64). "ceph status" shows something like (it varies): cluster 899509fe-afe4-42f4-a555-bb044ca0f52d health HEALTH_WARN 77 requests are blocked > 32 sec monmap e1: 3 mons at {a=134.107.24.179:6789/0,b=134.107.24.181:6789/0,c=134.107.24.183:6789/0}, election epoch 312, quorum 0,1,2 a,b,c osdmap e32600: 36 osds: 36 up, 36 in pgmap v16404527: 14304 pgs: 14304 active+clean; 20153 GB data, 60630 GB used, 39923 GB / 100553 GB avail; 1506KB/s rd, 21246B/s wr, 545op/s mdsmap e478: 1/1/1 up {0=c=up:active}, 1 up:standby-replay "ceph health detail" shows something like (it varies): HEALTH_WARN 363 requests are blocked > 32 sec; 22 osds have slow requests 363 ops are blocked > 32.768 sec 1 ops are blocked > 32.768 sec on osd.0 8 ops are blocked > 32.768 sec on osd.3 37 ops are blocked > 32.768 sec on osd.12 [...] 11 ops are blocked > 32.768 sec on osd.62 45 ops are blocked > 32.768 sec on osd.65 22 osds have slow requests The number and identity of affected OSDs constantly changes (sometimes health even goes to OK for a moment). Cheers and thanks for any ideas, Oliver _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com