Re: HELP ! Cluster unusable with lots of "hit suicide timeout"

Christian Balzer <chibi@xxxxxxx> · Wed, 19 Oct 2016 17:04:54 +0900

Hello,

no specific ideas, but this somewhat sounds familiar.

One thing first, you already stopped client traffic but to make sure your
cluster really becomes quiescent, stop all scrubs as well.
That's always a good idea in any recovery, overload situation.

Have you verified CPU load (are those OSD processes busy), memory status,
etc?
How busy are the actual disks?

Sudden deaths like this often are the results of network changes,  like a
switch rebooting and loosing jumbo frame configuration or whatnot.

Christian

On Wed, 19 Oct 2016 09:44:01 +0200 Yoann Moulin wrote:

> Dear List,
> 
> We have a cluster in Jewel 10.2.2 under ubuntu 16.04. The cluster is compose by 12 nodes, each nodes have 10 OSD with journal on disk.
> 
> We have one rbd partition and a radosGW with 2 data pool, one replicated, one EC (8+2)
> 
> in attachment few details on our cluster.
> 
> Currently, our cluster is not usable at all due to too much OSD instability. OSDs daemon die randomly with "hit suicide timeout". Yesterday, all
> of 120 OSDs died at least 12 time (max 74 time) with an average around 40 time
> 
> here logs from ceph mon and from one OSD :
> 
> http://icwww.epfl.ch/~ymoulin/ceph/cephprod.log.bz2 (6MB)
> http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.10.log.bz2 (6MB)
> 
> We have stopped all clients i/o to see if the cluster get stable without success, to avoid  endless rebalancing with OSD flapping, we had to
> "set noout" the cluster. For now we have no idea what's going on.
> 
> Anyone can help us to understand what's happening ?
> 
> thanks for your help
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com