Hello, Firstly, find my "Unexplainable slow request" thread in the ML archives and read all of it. On Tue, 26 May 2015 07:05:36 +0200 Xavier Serrano wrote: > Hello, > > We have observed that our cluster is often moving back and forth > from HEALTH_OK to HEALTH_WARN states due to "blocked requests". > We have also observed "blocked ops". For instance: > As always SW versions and a detailed HW description (down to the model of HDDs used) will be helpful and educational. > # ceph status > cluster 905a1185-b4f0-4664-b881-f0ad2d8be964 > health HEALTH_WARN > 1 requests are blocked > 32 sec > monmap e5: 5 mons at > {ceph-host-1=192.168.0.65:6789/0,ceph-host-2=192.168.0.66:6789/0,ceph-host-3=192.168.0.67:6789/0,ceph-host-4=192.168.0.68:6789/0,ceph-host-5=192.168.0.69:6789/0} > election epoch 44, quorum 0,1,2,3,4 > ceph-host-1,ceph-host-2,ceph-host-3,ceph-host-4,ceph-host-5 osdmap > e5091: 120 osds: 100 up, 100 in pgmap v473436: 2048 pgs, 2 pools, 4373 > GB data, 1093 kobjects 13164 GB used, 168 TB / 181 TB avail 2048 > active+clean client io 10574 kB/s rd, 33883 kB/s wr, 655 op/s > > # ceph health detail > HEALTH_WARN 1 requests are blocked > 32 sec; 1 osds have slow requests > 1 ops are blocked > 67108.9 sec > 1 ops are blocked > 67108.9 sec on osd.71 > 1 osds have slow requests > You will want to have a very close look at osd.71 (logs, internal counters, cranking up debugging), but might find it just as mysterious as my case in the thread mentioned above. > > My questions are: > (1) Is it normal to have "slow requests" in a cluster? Not really, though the Ceph developers clearly think those just merits a WARNING level, whereas I would consider those a clear sign of brokenness, as VMs or other clients with those requests pending are likely to be unusable at that point. > (2) Or is it a symptom that indicates that something is wrong? > (for example, a disk is about to fail) That. Of course your cluster could be just at the edge of its performance and nothing but improving that (most likely by adding more nodes/OSDs) would fix that. > (3) How can we fix the "slow requests"? Depends on cause of course. AFTER you exhausted all means and gotten all relevant log/performance data from osd.71 restarting the osd might be all that's needed. > (4) What's the meaning of "blocked ops", and how can they be > blocked so long? (67000 seconds is more than 18 hours!) Precisely, this shouldn't happen. > (5) How can we fix the "blocked ops"? > AFTER you exhausted all means and gotten all relevant log/performance data from osd.71 restarting the osd might be all that's needed. Christian -- Christian Balzer Network/Systems Engineer chibi at gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/