Hi, Sometimes we have the same issue on our 10.2.9 Cluster. (24 Nodes á 60 OSDs) I think there is some racecondition or something like that which results in this state. The blocking requests starts exactly at the time the PG begins to scrub. you can try the following. The OSD will automaticaly recover and the blocked requests will disapear. ceph osd down 31 In my opinion this is a bug but I have note investigated so far. Mayby some developer can say something about this issue Regards, Manuel Am Tue, 22 Aug 2017 16:20:14 +0300 schrieb Ramazan Terzi <ramazanterzi@xxxxxxxxx>: > Hello, > > I have a Ceph Cluster with specifications below: > 3 x Monitor node > 6 x Storage Node (6 disk per Storage Node, 6TB SATA Disks, all disks > have SSD journals) Distributed public and private networks. All NICs > are 10Gbit/s osd pool default size = 3 > osd pool default min size = 2 > > Ceph version is Jewel 10.2.6. > > My cluster is active and a lot of virtual machines running on it > (Linux and Windows VM's, database clusters, web servers etc). > > During normal use, cluster slowly went into a state of blocked > requests. Blocked requests periodically incrementing. All OSD's seems > healthy. Benchmark, iowait, network tests, all of them succeed. > > Yerterday, 08:00: > $ ceph health detail > HEALTH_WARN 3 requests are blocked > 32 sec; 3 osds have slow requests > 1 ops are blocked > 134218 sec on osd.31 > 1 ops are blocked > 134218 sec on osd.3 > 1 ops are blocked > 8388.61 sec on osd.29 > 3 osds have slow requests > > Todat, 16:05: > $ ceph health detail > HEALTH_WARN 32 requests are blocked > 32 sec; 3 osds have slow > requests 1 ops are blocked > 134218 sec on osd.31 > 1 ops are blocked > 134218 sec on osd.3 > 16 ops are blocked > 134218 sec on osd.29 > 11 ops are blocked > 67108.9 sec on osd.29 > 2 ops are blocked > 16777.2 sec on osd.29 > 1 ops are blocked > 8388.61 sec on osd.29 > 3 osds have slow requests > > $ ceph pg dump | grep scrub > dumped all in format plain > pg_stat objects mip degr misp > unf bytes log disklog state > state_stamp v reported up > up_primary acting acting_primary > last_scrub scrub_stamp last_deep_scrub > deep_scrub_stamp 20.1e 25183 0 0 0 > 0 98332537930 3066 3066 > active+clean+scrubbing 2017-08-21 04:55:13.354379 > 6930'23908781 6930:20905696 [29,31,3] 29 > [29,31,3] 29 6712'22950171 2017-08-20 > 04:46:59.208792 6712'22950171 2017-08-20 04:46:59.208792 > > Active scrub does not finish (about 24 hours). I did not restart any > OSD meanwhile. I'm thinking set noscrub, noscrub-deep, norebalance, > nobackfill, and norecover flags and restart 3,29,31th OSDs. Is this > solve my problem? Or anyone has suggestion about this problem? > > Thanks, > Ramazan > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Manuel Lausch Systemadministrator Cloud Services 1&1 Mail & Media Development & Technology GmbH | Brauerstraße 48 | 76135 Karlsruhe | Germany Phone: +49 721 91374-1847 E-Mail: manuel.lausch@xxxxxxxx | Web: www.1und1.de Amtsgericht Montabaur, HRB 5452 Geschäftsführer: Thomas Ludwig, Jan Oetjen Member of United Internet Diese E-Mail kann vertrauliche und/oder gesetzlich geschützte Informationen enthalten. Wenn Sie nicht der bestimmungsgemäße Adressat sind oder diese E-Mail irrtümlich erhalten haben, unterrichten Sie bitte den Absender und vernichten Sie diese E-Mail. Anderen als dem bestimmungsgemäßen Adressaten ist untersagt, diese E-Mail zu speichern, weiterzuleiten oder ihren Inhalt auf welche Weise auch immer zu verwenden. This e-mail may contain confidential and/or privileged information. If you are not the intended recipient of this e-mail, you are hereby notified that saving, distribution or use of the content of this e-mail in any way is prohibited. If you have received this e-mail in error, please notify the sender and delete the e-mail. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com