Re: HELP ! Cluster unusable with lots of "hit suicide timeout"

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Wed, 19 Oct 2016 15:33:34 +0200

On Wed, Oct 19, 2016 at 3:22 PM, Yoann Moulin <yoann.moulin@xxxxxxx> wrote:
> Hello,
>
>>> We have a cluster in Jewel 10.2.2 under ubuntu 16.04. The cluster is compose by 12 nodes, each nodes have 10 OSD with journal on disk.
>>>
>>> We have one rbd partition and a radosGW with 2 data pool, one replicated, one EC (8+2)
>>>
>>> in attachment few details on our cluster.
>>>
>>> Currently, our cluster is not usable at all due to too much OSD instability. OSDs daemon die randomly with "hit suicide timeout". Yesterday, all
>>> of 120 OSDs died at least 12 time (max 74 time) with an average around 40 time
>>>
>>> here logs from ceph mon and from one OSD :
>>>
>>> http://icwww.epfl.ch/~ymoulin/ceph/cephprod.log.bz2 (6MB)
>>
>> Do you have an older log showing the start of the incident? The
>> cluster was already down when this log started.
>
> Here the log from Saturday, OSD 134 is the first which had error :
>
> http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.134.log.4.bz2
> http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.10.log.4.bz2
> http://icwww.epfl.ch/~ymoulin/ceph/cephprod.log.4.bz2

Do you have osd.86's log? I think it was the first to fail:

2016-10-15 14:42:32.109025 mon.0 10.90.37.3:6789/0 5240160 : cluster
[INF] osd.86 10.90.37.15:6823/11625 failed (2 reporters from different
host after 20.000215 >= grace 20.000000)

Then these osds a couple seconds later:

2016-10-15 14:42:34.900989 mon.0 10.90.37.3:6789/0 5240180 : cluster
[INF] osd.27 10.90.37.5:6802/5426 failed (2 reporters from different
host after 20.000417 >= grace 20.000000)
2016-10-15 14:42:34.902105 mon.0 10.90.37.3:6789/0 5240183 : cluster
[INF] osd.95 10.90.37.12:6822/12403 failed (2 reporters from different
host after 20.001862 >= grace 20.000000)
2016-10-15 14:42:34.902653 mon.0 10.90.37.3:6789/0 5240185 : cluster
[INF] osd.131 10.90.37.25:6820/195317 failed (2 reporters from
different host after 20.002387 >= grace 20.000000)
2016-10-15 14:42:34.903205 mon.0 10.90.37.3:6789/0 5240187 : cluster
[INF] osd.136 10.90.37.23:6803/5148 failed (2 reporters from different
host after 20.002898 >= grace 20.000000)
2016-10-15 14:42:35.576139 mon.0 10.90.37.3:6789/0 5240191 : cluster
[INF] osd.24 10.90.37.3:6800/4587 failed (2 reporters from different
host after 21.384669 >= grace 20.094412)
2016-10-15 14:42:35.580217 mon.0 10.90.37.3:6789/0 5240193 : cluster
[INF] osd.37 10.90.37.11:6838/179566 failed (3 reporters from
different host after 20.680190 >= grace 20.243928)
2016-10-15 14:42:35.581550 mon.0 10.90.37.3:6789/0 5240195 : cluster
[INF] osd.46 10.90.37.9:6800/4811 failed (2 reporters from different
host after 21.389655 >= grace 20.000000)
2016-10-15 14:42:35.582286 mon.0 10.90.37.3:6789/0 5240197 : cluster
[INF] osd.64 10.90.37.21:6810/7658 failed (2 reporters from different
host after 21.390167 >= grace 20.409388)
2016-10-15 14:42:35.582823 mon.0 10.90.37.3:6789/0 5240199 : cluster
[INF] osd.107 10.90.37.19:6820/10260 failed (2 reporters from
different host after 21.390516 >= grace 20.074818)

Just a hunch, but do osds 86, 27, 95, etc... all share the same PG?
Use 'ceph pg dump' to check.

>
>>> http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.10.log.bz2 (6MB)
>>
>> In this log the thread which is hanging is doing deep-scrub:
>>
>> 2016-10-18 22:16:23.985462 7f12da4af700  0 log_channel(cluster) log
>> [INF] : 39.54 deep-scrub starts
>> 2016-10-18 22:16:39.008961 7f12e4cc4700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7f12da4af700' had timed out after 15
>> 2016-10-18 22:18:54.175912 7f12e34c1700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7f12da4af700' had suicide timed out after 150
>>
>> So you can disable scrubbing completely with
>>
>>   ceph osd set noscrub
>>   ceph osd set nodeep-scrub
>>
>> in case you are hitting some corner case with the scrubbing code.
>
> Now the cluster seem to be healthy. but as soon as I re enable scrubbing and rebalancing OSD start to flap and the cluster switch to HEATH_ERR
>

Looks like recover/backfill are enabled and you have otherwise all
clean PGs. Don't be afraid to leave scrubbing disabled until you
understand exactly what is going wrong.

Do you see any SCSI / IO errors on the disks failing to scrub?
Though, it seems unlikely that so many disks are all failing at the
same time. More likely there's at least one object that's giving the
scrubber problems and hanging the related OSDs.

>     cluster f9dfd27f-c704-4d53-9aa0-4a23d655c7c4
>       health HEALTH_WARN
>              noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>       monmap e1: 3 mons at
> {iccluster002.iccluster.epfl.ch=10.90.37.3:6789/0,iccluster010.iccluster.epfl.ch=10.90.37.11:6789/0,iccluster018.iccluster.epfl.ch=10.90.37.19:6789/0}
>              election epoch 64, quorum 0,1,2 iccluster002.iccluster.epfl.ch,iccluster010.iccluster.epfl.ch,iccluster018.iccluster.epfl.ch
>        fsmap e131: 1/1/1 up {0=iccluster022.iccluster.epfl.ch=up:active}, 2 up:standby
>       osdmap e72932: 144 osds: 144 up, 120 in
>              flags noout,noscrub,nodeep-scrub,sortbitwise
>        pgmap v4834810: 9408 pgs, 28 pools, 153 TB data, 75849 kobjects
>              449 TB used, 203 TB / 653 TB avail
>                  9408 active+clean
>
>
>>> We have stopped all clients i/o to see if the cluster get stable without success, to avoid  endless rebalancing with OSD flapping, we had to
>>> "set noout" the cluster. For now we have no idea what's going on.
>>>
>>> Anyone can help us to understand what's happening ?
>>
>> Is your network OK?
>
> We have one 10G nic for the private network and one 10G nic for the public network. The network is far under loaded right now and there is no
> error. We don't use jumbo frame.
>

OK seems not to be network related.

>> It will be useful to see the start of the incident to better
>> understand what caused this situation.
>>
>> Also, maybe useful for you... you can increase the suicide timeout, e.g.:
>>
>>    osd op thread suicide timeout: <something larger than 150>
>>
>> If the cluster is just *slow* somehow, then increasing that might
>> help. If there is something systematically broken, increasing would
>> just postpone the inevitable.
>
> Ok, I'm going to study this option with my colleagues

Probably not needed, since you found scrubbing to cause the problem.

-- Dan
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com