Re: Luminous rgw hangs after sighup

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



There have been other issues related to hangs during realm reconfiguration, ex http://tracker.ceph.com/issues/20937. We decided to revert the use of SIGHUP to trigger realm reconfiguration in https://github.com/ceph/ceph/pull/16807. I just started a backport of that for luminous.


On 12/11/2017 11:07 AM, Graham Allan wrote:
That's the issue I remember (#20763)!

The hang happened to me once, on this cluster, after upgrade from jewel to 12.2.2; then on Friday I disabled automatic bucket resharding due to some other problems - didn't get any logrotate-related hangs through the weekend. I wonder if these could be related?

Graham

On 12/11/2017 02:01 AM, Martin Emrich wrote:
Hi!

This sounds like http://tracker.ceph.com/issues/20763 (or indeed http://tracker.ceph.com/issues/20866).

It is still present in 12.2.2 (just tried it). My workaround is to exclude radosgw from logrotate (remove "radosgw" from /etc/logrotate.d/ceph) from being SIGHUPed, and to rotate the logs manually from time to time and completely restarting the radosgw processes one after the other on my radosgw cluster.

Regards,

Martin

Am 08.12.17, 18:58 schrieb "ceph-users im Auftrag von Graham Allan" <ceph-users-bounces@xxxxxxxxxxxxxx im Auftrag von gta@xxxxxxx>:

     I noticed this morning that all four of our rados gateways (luminous      12.2.2) hung at logrotate time overnight. The last message logged was:           > 2017-12-08 03:21:01.897363 7fac46176700  0 ERROR: failed to clone shard, completion_mgr.get_next() returned ret=-125
          one of the 3 nodes recorded more detail:
     > 2017-12-08 06:51:04.452108 7f80fbfdf700  1 rgw realm reloader: Pausing frontends for realm update...      > 2017-12-08 06:51:04.452126 7f80fbfdf700  1 rgw realm reloader: Frontends paused      > 2017-12-08 06:51:04.452891 7f8202436700  0 ERROR: failed to clone shard, completion_mgr.get_next() returned ret=-125
     I remember seeing this happen on our test cluster a while back with
     Kraken. I can't find the tracker issue I originally found related to      this, but it also sounds like it could be a reversion of bug #20339 or
     #20686?
          I recorded some strace output from one of the radosgw instances before
     restarting, if it's useful to open an issue.
          --
     Graham Allan
     Minnesota Supercomputing Institute - gta@xxxxxxx
     _______________________________________________
     ceph-users mailing list
     ceph-users@xxxxxxxxxxxxxx
     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux