Re: HELP ! Cluster unusable with lots of "hit suicide timeout"

Yoann Moulin <yoann.moulin@xxxxxxx> · Wed, 19 Oct 2016 15:22:41 +0200

Hello,

>> We have a cluster in Jewel 10.2.2 under ubuntu 16.04. The cluster is compose by 12 nodes, each nodes have 10 OSD with journal on disk.
>>
>> We have one rbd partition and a radosGW with 2 data pool, one replicated, one EC (8+2)
>>
>> in attachment few details on our cluster.
>>
>> Currently, our cluster is not usable at all due to too much OSD instability. OSDs daemon die randomly with "hit suicide timeout". Yesterday, all
>> of 120 OSDs died at least 12 time (max 74 time) with an average around 40 time
>>
>> here logs from ceph mon and from one OSD :
>>
>> http://icwww.epfl.ch/~ymoulin/ceph/cephprod.log.bz2 (6MB)
> 
> Do you have an older log showing the start of the incident? The
> cluster was already down when this log started.

Here the log from Saturday, OSD 134 is the first which had error :

http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.134.log.4.bz2
http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.10.log.4.bz2
http://icwww.epfl.ch/~ymoulin/ceph/cephprod.log.4.bz2

>> http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.10.log.bz2 (6MB)
> 
> In this log the thread which is hanging is doing deep-scrub:
> 
> 2016-10-18 22:16:23.985462 7f12da4af700  0 log_channel(cluster) log
> [INF] : 39.54 deep-scrub starts
> 2016-10-18 22:16:39.008961 7f12e4cc4700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f12da4af700' had timed out after 15
> 2016-10-18 22:18:54.175912 7f12e34c1700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f12da4af700' had suicide timed out after 150
> 
> So you can disable scrubbing completely with
> 
>   ceph osd set noscrub
>   ceph osd set nodeep-scrub
> 
> in case you are hitting some corner case with the scrubbing code.

Now the cluster seem to be healthy. but as soon as I re enable scrubbing and rebalancing OSD start to flap and the cluster switch to HEATH_ERR

    cluster f9dfd27f-c704-4d53-9aa0-4a23d655c7c4
      health HEALTH_WARN
             noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
      monmap e1: 3 mons at
{iccluster002.iccluster.epfl.ch=10.90.37.3:6789/0,iccluster010.iccluster.epfl.ch=10.90.37.11:6789/0,iccluster018.iccluster.epfl.ch=10.90.37.19:6789/0}
             election epoch 64, quorum 0,1,2 iccluster002.iccluster.epfl.ch,iccluster010.iccluster.epfl.ch,iccluster018.iccluster.epfl.ch
       fsmap e131: 1/1/1 up {0=iccluster022.iccluster.epfl.ch=up:active}, 2 up:standby
      osdmap e72932: 144 osds: 144 up, 120 in
             flags noout,noscrub,nodeep-scrub,sortbitwise
       pgmap v4834810: 9408 pgs, 28 pools, 153 TB data, 75849 kobjects
             449 TB used, 203 TB / 653 TB avail
                 9408 active+clean

>> We have stopped all clients i/o to see if the cluster get stable without success, to avoid  endless rebalancing with OSD flapping, we had to
>> "set noout" the cluster. For now we have no idea what's going on.
>>
>> Anyone can help us to understand what's happening ?
> 
> Is your network OK?

We have one 10G nic for the private network and one 10G nic for the public network. The network is far under loaded right now and there is no
error. We don't use jumbo frame.

> It will be useful to see the start of the incident to better
> understand what caused this situation.
>
> Also, maybe useful for you... you can increase the suicide timeout, e.g.:
> 
>    osd op thread suicide timeout: <something larger than 150>
> 
> If the cluster is just *slow* somehow, then increasing that might
> help. If there is something systematically broken, increasing would
> just postpone the inevitable.

Ok, I'm going to study this option with my colleagues

thanks

-- 
Yoann Moulin
EPFL IC-IT
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com