There have been other issues related to hangs during realm
reconfiguration, ex http://tracker.ceph.com/issues/20937. We decided to
revert the use of SIGHUP to trigger realm reconfiguration in
https://github.com/ceph/ceph/pull/16807. I just started a backport of
that for luminous.
On 12/11/2017 11:07 AM, Graham Allan wrote:
That's the issue I remember (#20763)!
The hang happened to me once, on this cluster, after upgrade from
jewel to 12.2.2; then on Friday I disabled automatic bucket resharding
due to some other problems - didn't get any logrotate-related hangs
through the weekend. I wonder if these could be related?
Graham
On 12/11/2017 02:01 AM, Martin Emrich wrote:
Hi!
This sounds like http://tracker.ceph.com/issues/20763 (or indeed
http://tracker.ceph.com/issues/20866).
It is still present in 12.2.2 (just tried it). My workaround is to
exclude radosgw from logrotate (remove "radosgw" from
/etc/logrotate.d/ceph) from being SIGHUPed, and to rotate the logs
manually from time to time and completely restarting the radosgw
processes one after the other on my radosgw cluster.
Regards,
Martin
Am 08.12.17, 18:58 schrieb "ceph-users im Auftrag von Graham Allan"
<ceph-users-bounces@xxxxxxxxxxxxxx im Auftrag von gta@xxxxxxx>:
I noticed this morning that all four of our rados gateways
(luminous
12.2.2) hung at logrotate time overnight. The last message
logged was:
> 2017-12-08 03:21:01.897363 7fac46176700 0 ERROR: failed
to clone shard, completion_mgr.get_next() returned ret=-125
one of the 3 nodes recorded more detail:
> 2017-12-08 06:51:04.452108 7f80fbfdf700 1 rgw realm reloader:
Pausing frontends for realm update...
> 2017-12-08 06:51:04.452126 7f80fbfdf700 1 rgw realm reloader:
Frontends paused
> 2017-12-08 06:51:04.452891 7f8202436700 0 ERROR: failed to
clone shard, completion_mgr.get_next() returned ret=-125
I remember seeing this happen on our test cluster a while back with
Kraken. I can't find the tracker issue I originally found
related to
this, but it also sounds like it could be a reversion of bug
#20339 or
#20686?
I recorded some strace output from one of the radosgw
instances before
restarting, if it's useful to open an issue.
--
Graham Allan
Minnesota Supercomputing Institute - gta@xxxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com