Re: HELP ! Cluster unusable with lots of "hit suicide timeout"

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Wed, 19 Oct 2016 10:00:55 +0200

Hi Yoann,

On Wed, Oct 19, 2016 at 9:44 AM, Yoann Moulin <yoann.moulin@xxxxxxx> wrote:
> Dear List,
>
> We have a cluster in Jewel 10.2.2 under ubuntu 16.04. The cluster is compose by 12 nodes, each nodes have 10 OSD with journal on disk.
>
> We have one rbd partition and a radosGW with 2 data pool, one replicated, one EC (8+2)
>
> in attachment few details on our cluster.
>
> Currently, our cluster is not usable at all due to too much OSD instability. OSDs daemon die randomly with "hit suicide timeout". Yesterday, all
> of 120 OSDs died at least 12 time (max 74 time) with an average around 40 time
>
> here logs from ceph mon and from one OSD :
>
> http://icwww.epfl.ch/~ymoulin/ceph/cephprod.log.bz2 (6MB)

Do you have an older log showing the start of the incident? The
cluster was already down when this log started.

> http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.10.log.bz2 (6MB)

In this log the thread which is hanging is doing deep-scrub:

2016-10-18 22:16:23.985462 7f12da4af700  0 log_channel(cluster) log
[INF] : 39.54 deep-scrub starts
2016-10-18 22:16:39.008961 7f12e4cc4700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f12da4af700' had timed out after 15
2016-10-18 22:18:54.175912 7f12e34c1700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f12da4af700' had suicide timed out after 150

So you can disable scrubbing completely with

  ceph osd set noscrub
  ceph osd set nodeep-scrub

in case you are hitting some corner case with the scrubbing code.

> We have stopped all clients i/o to see if the cluster get stable without success, to avoid  endless rebalancing with OSD flapping, we had to
> "set noout" the cluster. For now we have no idea what's going on.
>
> Anyone can help us to understand what's happening ?

Is your network OK?

It will be useful to see the start of the incident to better
understand what caused this situation.

Also, maybe useful for you... you can increase the suicide timeout, e.g.:

   osd op thread suicide timeout: <something larger than 150>

If the cluster is just *slow* somehow, then increasing that might
help. If there is something systematically broken, increasing would
just postpone the inevitable.

-- Dan

>
> thanks for your help
>
> --
> Yoann Moulin
> EPFL IC-IT
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com