Hi! This sounds like http://tracker.ceph.com/issues/20763 (or indeed http://tracker.ceph.com/issues/20866). It is still present in 12.2.2 (just tried it). My workaround is to exclude radosgw from logrotate (remove "radosgw" from /etc/logrotate.d/ceph) from being SIGHUPed, and to rotate the logs manually from time to time and completely restarting the radosgw processes one after the other on my radosgw cluster. Regards, Martin Am 08.12.17, 18:58 schrieb "ceph-users im Auftrag von Graham Allan" <ceph-users-bounces@xxxxxxxxxxxxxx im Auftrag von gta@xxxxxxx>: I noticed this morning that all four of our rados gateways (luminous 12.2.2) hung at logrotate time overnight. The last message logged was: > 2017-12-08 03:21:01.897363 7fac46176700 0 ERROR: failed to clone shard, completion_mgr.get_next() returned ret=-125 one of the 3 nodes recorded more detail: > 2017-12-08 06:51:04.452108 7f80fbfdf700 1 rgw realm reloader: Pausing frontends for realm update... > 2017-12-08 06:51:04.452126 7f80fbfdf700 1 rgw realm reloader: Frontends paused > 2017-12-08 06:51:04.452891 7f8202436700 0 ERROR: failed to clone shard, completion_mgr.get_next() returned ret=-125 I remember seeing this happen on our test cluster a while back with Kraken. I can't find the tracker issue I originally found related to this, but it also sounds like it could be a reversion of bug #20339 or #20686? I recorded some strace output from one of the radosgw instances before restarting, if it's useful to open an issue. -- Graham Allan Minnesota Supercomputing Institute - gta@xxxxxxx _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com