Luminous rgw hangs after sighup

Graham Allan <gta@xxxxxxx> · Fri, 8 Dec 2017 11:57:52 -0600

I noticed this morning that all four of our rados gateways (luminous 
12.2.2) hung at logrotate time overnight. The last message logged was:

2017-12-08 03:21:01.897363 7fac46176700  0 ERROR: failed to clone shard, completion_mgr.get_next() returned ret=-125

one of the 3 nodes recorded more detail:
2017-12-08 06:51:04.452108 7f80fbfdf700  1 rgw realm reloader: Pausing frontends for realm update...
2017-12-08 06:51:04.452126 7f80fbfdf700  1 rgw realm reloader: Frontends paused
2017-12-08 06:51:04.452891 7f8202436700  0 ERROR: failed to clone shard, completion_mgr.get_next() returned ret=-125
I remember seeing this happen on our test cluster a while back with 
Kraken. I can't find the tracker issue I originally found related to 
this, but it also sounds like it could be a reversion of bug #20339 or 
#20686?

I recorded some strace output from one of the radosgw instances before 
restarting, if it's useful to open an issue.

--
Graham Allan
Minnesota Supercomputing Institute - gta@xxxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com