Hi Yoann, On Wed, Oct 19, 2016 at 9:44 AM, Yoann Moulin <yoann.moulin@xxxxxxx> wrote: > Dear List, > > We have a cluster in Jewel 10.2.2 under ubuntu 16.04. The cluster is compose by 12 nodes, each nodes have 10 OSD with journal on disk. > > We have one rbd partition and a radosGW with 2 data pool, one replicated, one EC (8+2) > > in attachment few details on our cluster. > > Currently, our cluster is not usable at all due to too much OSD instability. OSDs daemon die randomly with "hit suicide timeout". Yesterday, all > of 120 OSDs died at least 12 time (max 74 time) with an average around 40 time > > here logs from ceph mon and from one OSD : > > http://icwww.epfl.ch/~ymoulin/ceph/cephprod.log.bz2 (6MB) Do you have an older log showing the start of the incident? The cluster was already down when this log started. > http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.10.log.bz2 (6MB) In this log the thread which is hanging is doing deep-scrub: 2016-10-18 22:16:23.985462 7f12da4af700 0 log_channel(cluster) log [INF] : 39.54 deep-scrub starts 2016-10-18 22:16:39.008961 7f12e4cc4700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f12da4af700' had timed out after 15 2016-10-18 22:18:54.175912 7f12e34c1700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f12da4af700' had suicide timed out after 150 So you can disable scrubbing completely with ceph osd set noscrub ceph osd set nodeep-scrub in case you are hitting some corner case with the scrubbing code. > We have stopped all clients i/o to see if the cluster get stable without success, to avoid endless rebalancing with OSD flapping, we had to > "set noout" the cluster. For now we have no idea what's going on. > > Anyone can help us to understand what's happening ? Is your network OK? It will be useful to see the start of the incident to better understand what caused this situation. Also, maybe useful for you... you can increase the suicide timeout, e.g.: osd op thread suicide timeout: <something larger than 150> If the cluster is just *slow* somehow, then increasing that might help. If there is something systematically broken, increasing would just postpone the inevitable. -- Dan > > thanks for your help > > -- > Yoann Moulin > EPFL IC-IT > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com