I noticed this morning that all four of our rados gateways (luminous
12.2.2) hung at logrotate time overnight. The last message logged was:
2017-12-08 03:21:01.897363 7fac46176700 0 ERROR: failed to clone shard, completion_mgr.get_next() returned ret=-125
one of the 3 nodes recorded more detail:
2017-12-08 06:51:04.452108 7f80fbfdf700 1 rgw realm reloader: Pausing frontends for realm update...
2017-12-08 06:51:04.452126 7f80fbfdf700 1 rgw realm reloader: Frontends paused
2017-12-08 06:51:04.452891 7f8202436700 0 ERROR: failed to clone shard, completion_mgr.get_next() returned ret=-125
I remember seeing this happen on our test cluster a while back with
Kraken. I can't find the tracker issue I originally found related to
this, but it also sounds like it could be a reversion of bug #20339 or
#20686?
I recorded some strace output from one of the radosgw instances before
restarting, if it's useful to open an issue.
--
Graham Allan
Minnesota Supercomputing Institute - gta@xxxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com