Re: Constant slow / blocked requests with otherwise healthy cluster

Andrey Korolyov <andrey@xxxxxxx> · Thu, 28 Nov 2013 01:00:46 +0400

Hey,

What number do you have for a replication factor? As for three, 1.5k
IOPS may be a little bit high for 36 disks, and your OSD ids looks a bit
suspicious - there should not be 60+ OSDs based on calculation from
numbers below.

On 11/28/2013 12:45 AM, Oliver Schulz wrote:
> Dear Ceph Experts,
> 
> our Ceph cluster suddenly went into a state of OSDs constantly having
> blocked or slow requests, rendering the cluster unusable. This happened
> during normal use, there were no updates, etc.
> 
> All disks seem to be healthy (smartctl, iostat, etc.). A complete
> hardware reboot including system update on all nodes has not helped.
> The network equipment also shows no trouble.
> 
> We'd be glad for any advice on how to diagnose and solve this, as
> the cluster is basically at a standstill and we urgently need
> to get it back into operation.
> 
> Cluster structure: 6 Nodes, 6x 3TB disks plus 1x System/Journal SSD
> per node, one OSD per disk. We're running ceph version 0.67.4-1precise
> on Ubuntu 12.04.3 with kernel 3.8.0-33-generic (x86_64).
> 
> "ceph status" shows something like (it varies):
> 
>     cluster 899509fe-afe4-42f4-a555-bb044ca0f52d
>      health HEALTH_WARN 77 requests are blocked > 32 sec
>      monmap e1: 3 mons at
> {a=134.107.24.179:6789/0,b=134.107.24.181:6789/0,c=134.107.24.183:6789/0},
> election epoch 312, quorum 0,1,2 a,b,c
>      osdmap e32600: 36 osds: 36 up, 36 in
>       pgmap v16404527: 14304 pgs: 14304 active+clean; 20153 GB data,
> 60630 GB used, 39923 GB / 100553 GB avail; 1506KB/s rd, 21246B/s wr,
> 545op/s
>      mdsmap e478: 1/1/1 up {0=c=up:active}, 1 up:standby-replay
> 
> "ceph health detail" shows something like (it varies):
> 
>     HEALTH_WARN 363 requests are blocked > 32 sec; 22 osds have slow
> requests
>     363 ops are blocked > 32.768 sec
>     1 ops are blocked > 32.768 sec on osd.0
>     8 ops are blocked > 32.768 sec on osd.3
>     37 ops are blocked > 32.768 sec on osd.12
>     [...]
>     11 ops are blocked > 32.768 sec on osd.62
>     45 ops are blocked > 32.768 sec on osd.65
>     22 osds have slow requests
> 
> The number and identity of affected OSDs constantly changes
> (sometimes health even goes to OK for a moment).
> 
> 
> Cheers and thanks for any ideas,
> 
> Oliver
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com