Blocked requests/ops?

chibi@xxxxxxx (Christian Balzer) · Tue, 26 May 2015 14:19:22 +0900

Hello,

Firstly, find my "Unexplainable slow request" thread in the ML archives
and read all of it.

On Tue, 26 May 2015 07:05:36 +0200 Xavier Serrano wrote:

> Hello,
> 
> We have observed that our cluster is often moving back and forth
> from HEALTH_OK to HEALTH_WARN states due to "blocked requests".
> We have also observed "blocked ops". For instance:
> 
As always SW versions and a detailed HW description (down to the model of
HDDs used) will be helpful and educational.

> # ceph status
>     cluster 905a1185-b4f0-4664-b881-f0ad2d8be964
>      health HEALTH_WARN
>             1 requests are blocked > 32 sec
>      monmap e5: 5 mons at
> {ceph-host-1=192.168.0.65:6789/0,ceph-host-2=192.168.0.66:6789/0,ceph-host-3=192.168.0.67:6789/0,ceph-host-4=192.168.0.68:6789/0,ceph-host-5=192.168.0.69:6789/0}
> election epoch 44, quorum 0,1,2,3,4
> ceph-host-1,ceph-host-2,ceph-host-3,ceph-host-4,ceph-host-5 osdmap
> e5091: 120 osds: 100 up, 100 in pgmap v473436: 2048 pgs, 2 pools, 4373
> GB data, 1093 kobjects 13164 GB used, 168 TB / 181 TB avail 2048
> active+clean client io 10574 kB/s rd, 33883 kB/s wr, 655 op/s
> 
> # ceph health detail
> HEALTH_WARN 1 requests are blocked > 32 sec; 1 osds have slow requests
> 1 ops are blocked > 67108.9 sec
> 1 ops are blocked > 67108.9 sec on osd.71
> 1 osds have slow requests
> 
You will want to have a very close look at osd.71 (logs, internal
counters, cranking up debugging), but might find it just as mysterious as
my case in the thread mentioned above.

> 
> My questions are:
> (1) Is it normal to have "slow requests" in a cluster?
Not really, though the Ceph developers clearly think those just merits a
WARNING level, whereas I would consider those a clear sign of brokenness,
as VMs or other clients with those requests pending are likely to be
unusable at that point.

> (2) Or is it a symptom that indicates that something is wrong?
>     (for example, a disk is about to fail)
That. Of course your cluster could be just at the edge of its performance
and nothing but improving that (most likely by adding more nodes/OSDs)
would fix that.

> (3) How can we fix the "slow requests"?
Depends on cause of course.
AFTER you exhausted all means and gotten all relevant log/performance data
from osd.71 restarting the osd might be all that's needed.

> (4) What's the meaning of "blocked ops", and how can they be
>     blocked so long? (67000 seconds is more than 18 hours!)
Precisely, this shouldn't happen.

> (5) How can we fix the "blocked ops"?
> 
AFTER you exhausted all means and gotten all relevant log/performance data
from osd.71 restarting the osd might be all that's needed.

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/