Re: Luminous rgw hangs after sighup

Casey Bodley <cbodley@xxxxxxxxxx> · Mon, 11 Dec 2017 11:24:43 -0500

There have been other issues related to hangs during realm 
reconfiguration, ex http://tracker.ceph.com/issues/20937. We decided to 
revert the use of SIGHUP to trigger realm reconfiguration in 
https://github.com/ceph/ceph/pull/16807. I just started a backport of 
that for luminous.

On 12/11/2017 11:07 AM, Graham Allan wrote:
That's the issue I remember (#20763)!

The hang happened to me once, on this cluster, after upgrade from 
jewel to 12.2.2; then on Friday I disabled automatic bucket resharding 
due to some other problems - didn't get any logrotate-related hangs 
through the weekend. I wonder if these could be related?

Graham

On 12/11/2017 02:01 AM, Martin Emrich wrote:
Hi!

This sounds like http://tracker.ceph.com/issues/20763 (or indeed 
http://tracker.ceph.com/issues/20866).

It is still present in 12.2.2 (just tried it). My workaround is to 
exclude radosgw from logrotate (remove "radosgw" from 
/etc/logrotate.d/ceph) from being SIGHUPed, and to rotate the logs 
manually from time to time and completely restarting the radosgw 
processes one after the other on my radosgw cluster.

Regards,

Martin

Am 08.12.17, 18:58 schrieb "ceph-users im Auftrag von Graham Allan" 
<ceph-users-bounces@xxxxxxxxxxxxxx im Auftrag von gta@xxxxxxx>:

     I noticed this morning that all four of our rados gateways 
(luminous
     12.2.2) hung at logrotate time overnight. The last message 
logged was:
          > 2017-12-08 03:21:01.897363 7fac46176700  0 ERROR: failed 
to clone shard, completion_mgr.get_next() returned ret=-125
          one of the 3 nodes recorded more detail:
     > 2017-12-08 06:51:04.452108 7f80fbfdf700  1 rgw realm reloader: 
Pausing frontends for realm update...
     > 2017-12-08 06:51:04.452126 7f80fbfdf700  1 rgw realm reloader: 
Frontends paused
     > 2017-12-08 06:51:04.452891 7f8202436700  0 ERROR: failed to 
clone shard, completion_mgr.get_next() returned ret=-125
     I remember seeing this happen on our test cluster a while back with
     Kraken. I can't find the tracker issue I originally found 
related to
     this, but it also sounds like it could be a reversion of bug 
#20339 or
     #20686?
          I recorded some strace output from one of the radosgw 
instances before
     restarting, if it's useful to open an issue.
          --
     Graham Allan
     Minnesota Supercomputing Institute - gta@xxxxxxx
     _______________________________________________
     ceph-users mailing list
     ceph-users@xxxxxxxxxxxxxx
     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com