Re: Received signal: Hangup from killall

Rok Jaklič <rjaklic@xxxxxxxxx> · Mon, 9 Oct 2023 14:46:41 +0200

After looking through documentation soft log kills are "normal", however in
radosgw logs we found:
2023-10-06T01:31:32.920+0200 7fb6f440b700  0 INFO: RGWReshardLock::lock
found lock on reshard.0000000002 to be held by another RGW process;
skipping for now
2023-10-06T01:31:33.371+0200 7fb6f440b700  0 INFO: RGWReshardLock::lock
found lock on reshard.0000000004 to be held by another RGW process;
skipping for now
2023-10-06T01:31:33.521+0200 7fb6f440b700  0 INFO: RGWReshardLock::lock
found lock on reshard.0000000006 to be held by another RGW process;
skipping for now
2023-10-06T01:31:33.853+0200 7fb6f440b700  0 INFO: RGWReshardLock::lock
found lock on reshard.0000000008 to be held by another RGW process;
skipping for now
2023-10-06T01:31:34.598+0200 7fb6f440b700  0 INFO: RGWReshardLock::lock
found lock on reshard.0000000012 to be held by another RGW process;
skipping for now
2023-10-06T01:31:34.740+0200 7fb6f440b700  0 INFO: RGWReshardLock::lock
found lock on reshard.0000000014 to be held by another RGW process;
skipping for now
...
after this line ... it seems that rgw stopped responding.

And the next day it stopped again almost at the same time
2023-10-07T01:27:26.299+0200 7f6216651700  0 INFO: RGWReshardLock::lock
found lock on reshard.0000000005 to be held by another RGW process;
skipping for now
2023-10-07T01:37:28.077+0200 7f6216651700  0 INFO: RGWReshardLock::lock
found lock on reshard.0000000014 to be held by another RGW process;
skipping for now
2023-10-07T01:47:27.333+0200 7f6216651700  0 INFO: RGWReshardLock::lock
found lock on reshard.0000000001 to be held by another RGW process;
skipping for now
2023-10-07T02:47:29.863+0200 7f6216651700  0 INFO: RGWReshardLock::lock
found lock on reshard.0000000006 to be held by another RGW process;
skipping for now
...
after this line ... rgw stopped responding. We had to restart it.

We were just about to upgrade to ceph 17.x... but we had postpone it
because of this.

Rok

On Fri, Oct 6, 2023 at 9:30 AM Rok Jaklič <rjaklic@xxxxxxxxx> wrote:

> Hi,
>
> yesterday we changed RGW from civetweb to beast and at 04:02 RGW stopped
> working; we had to restart it in the morning.
>
> In one rgw log for previous day we can see:
> 2023-10-06T04:02:01.105+0200 7fb71d45d700 -1 received  signal: Hangup from
> killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw
> rbd-mirror cephfs-mirror  (PID: 3202663) UID: 0
> and in the next day log we can see:
> 2023-10-06T04:02:01.133+0200 7fb71d45d700 -1 received  signal: Hangup from
>  (PID: 3202664) UID: 0
>
> and after that no requests came. We had to restart rgw.
>
> In ceph.conf we have something like
>
> [client.radosgw.ctplmon2]
> host = ctplmon2
> log_file = /var/log/ceph/client.radosgw.ctplmon2.log
> rgw_dns_name = ctplmon2
> rgw_frontends = "beast ssl_endpoint=0.0.0.0:4443 ssl_certificate=..."
> rgw_max_put_param_size = 15728640
>
> We assume it has something to do with logrotate.
>
> /etc/logrotate.d/ceph:
> /var/log/ceph/*.log {
>     rotate 90
>     daily
>     compress
>     sharedscripts
>     postrotate
>         killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse
> radosgw rbd-mirror cephfs-mirror || pkill -1 -x
> "ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph-fuse|radosgw|rbd-mirror|cephfs-mirror"
> || true
>     endscript
>     missingok
>     notifempty
>     su root ceph
> }
>
> ceph version 16.2.14 (238ba602515df21ea7ffc75c88db29f9e5ef12c9) pacific
> (stable)
>
> And ideas why this happend?
>
> Kind regards,
> Rok
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx