Blocked requests/ops?

xserrano+ceph@xxxxxxxxxx (Xavier Serrano) · Tue, 26 May 2015 07:05:36 +0200

Hello,

We have observed that our cluster is often moving back and forth
from HEALTH_OK to HEALTH_WARN states due to "blocked requests".
We have also observed "blocked ops". For instance:

# ceph status
    cluster 905a1185-b4f0-4664-b881-f0ad2d8be964
     health HEALTH_WARN
            1 requests are blocked > 32 sec
     monmap e5: 5 mons at {ceph-host-1=192.168.0.65:6789/0,ceph-host-2=192.168.0.66:6789/0,ceph-host-3=192.168.0.67:6789/0,ceph-host-4=192.168.0.68:6789/0,ceph-host-5=192.168.0.69:6789/0}
            election epoch 44, quorum 0,1,2,3,4 ceph-host-1,ceph-host-2,ceph-host-3,ceph-host-4,ceph-host-5
     osdmap e5091: 120 osds: 100 up, 100 in
      pgmap v473436: 2048 pgs, 2 pools, 4373 GB data, 1093 kobjects
            13164 GB used, 168 TB / 181 TB avail
                2048 active+clean
  client io 10574 kB/s rd, 33883 kB/s wr, 655 op/s

# ceph health detail
HEALTH_WARN 1 requests are blocked > 32 sec; 1 osds have slow requests
1 ops are blocked > 67108.9 sec
1 ops are blocked > 67108.9 sec on osd.71
1 osds have slow requests

My questions are:
(1) Is it normal to have "slow requests" in a cluster?
(2) Or is it a symptom that indicates that something is wrong?
    (for example, a disk is about to fail)
(3) How can we fix the "slow requests"?
(4) What's the meaning of "blocked ops", and how can they be
    blocked so long? (67000 seconds is more than 18 hours!)
(5) How can we fix the "blocked ops"?

Thank you very much for your help.

Best regards,
- Xavier Serrano
- LCAC, Laboratori de C?lcul
- Departament d'Arquitectura de Computadors, UPC