Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

Boris Behrens <bb@xxxxxxxxx> · Tue, 8 Mar 2022 11:57:00 +0100

Hi Francois,

thanks for the reminder. We offline compacted all of the OSDs when we
reinstalled the hosts with the new OS.
But actually reinstalling them was never on my list.

I could try that and in the same go I can remove all the cache SSDs (when
one SSD share the cache for 10 OSDs this is a horrible SPOF) and reuse the
SSDs as OSDs for the smaller pools in a RGW (like log and meta).

How long ago did you recreate the earliest OSD?

Cheers
 Boris

Am Di., 8. März 2022 um 10:03 Uhr schrieb Francois Legrand <
fleg@xxxxxxxxxxxxxx>:

> Hi,
> We also had this kind of problems after upgrading to octopus. Maybe you
> can play with the hearthbeat grace time (
> https://docs.ceph.com/en/latest/rados/configuration/mon-osd-interaction/
> ) to tell osds to wait a little more before declaring another osd down !
> We also try to fix the problem by manually compact the down osd
> (something like : systemctl stop ceph-osd@74; sleep 10;
> ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-74 compact;
> systemctl start ceph-osd@74).
> This worked a few times, but some osd went down again, thus we simply
> wait for the datas to be reconstructed elswhere and then reinstall the
> dead osd :
> ceph osd destroy 74 --yes-i-really-mean-it
> ceph-volume lvm zap /dev/sde --destroy
> ceph-volume lvm create --osd-id 74 --data /dev/sde
>
> This seems to fix the issue for us (up to now).
>
> F.
>
> Le 08/03/2022 à 09:35, Boris Behrens a écrit :
> > Yes, this is something we know and we disabled it, because we ran into
> the
> > problem that PGs went unavailable when two or more OSDs went offline.
> >
> > I am searching for the reason WHY this happens.
> > Currently we have set the service file to restart=always and removed the
> > StartLimitBurst from the service file.
> >
> > We just don't understand why the OSDs don't answer the heathbeat. The
> OSDs
> > that are flapping are random in terms of Host, Disksize, having SSD
> > block.db or not.
> > Network connectivity issues is something that I would rule out, because
> the
> > cluster went from "nothing ever happens except IOPS" to "random OSDs are
> > marked DOWN until they kill themself" with the update from nautilus to
> > octopus.
> >
> > I am out of ideas and hoped this was a bug in 15.2.15, but after the
> update
> > things got worse (happen more often).
> > We tried to:
> > * disable swap
> > * more swap
> > * disable bluefs_buffered_io
> > * disable write cache for all disks
> > * disable scrubbing
> > * reinstall with new OS (from centos7 to ubuntu 20.04)
> > * disable cluster_network (so there is only one way to communicate)
> > * increase txqueuelen on the network interfaces
> > * everything together
> >
> >
> > What we try next: add more SATA controllers, so there are not 24 disks
> > attached to a single controller, but I doubt this will help.
> >
> > Cheers
> >   Boris
> >
> >
> >
> > Am Di., 8. März 2022 um 09:10 Uhr schrieb Dan van der Ster <
> > dvanders@xxxxxxxxx>:
> >
> >> Here's the reason they exit:
> >>
> >> 7f1605dc9700 -1 osd.97 486896 _committed_osd_maps marked down 6 >
> >> osd_max_markdown_count 5 in last 600.000000 seconds, shutting down
> >>
> >> If an osd flaps (marked down, then up) 6 times in 10 minutes, it
> >> exits. (This is a safety measure).
> >>
> >> It's normally caused by a network issue -- other OSDs are telling the
> >> mon that he is down, but then the OSD himself tells the mon that he's
> >> up!
> >>
> >> Cheers, Dan
> >>
> >> On Mon, Mar 7, 2022 at 10:36 PM Boris Behrens <bb@xxxxxxxxx> wrote:
> >>> Hi,
> >>>
> >>> we've had the problem with OSDs marked as offline since we updated to
> >>> octopus and hope the problem would be fixed with the latest patch. We
> >> have
> >>> this kind of problem only with octopus and there only with the big s3
> >>> cluster.
> >>> * Hosts are all Ubuntu 20,04 and we've set the txqueuelen to 10k
> >>> * Network interfaces are 20gbit (2x10 in a 802.3ad encap3+4 bond)
> >>> * We only use the frontend network.
> >>> * All disks are spinning, some have block.db devices.
> >>> * All disks are bluestore
> >>> * configs are mostly defaults
> >>> * we've set the OSDs to restart=always without a limit, because we had
> >> the
> >>> problem with unavailable PGs when two OSDs are marked as offline and
> the
> >>> share PGs.
> >>>
> >>> But since we installed the latest patch we are experiencing more OSD
> >> downs
> >>> and even crashes.
> >>> I tried to remove as much duplicated lines as possible.
> >>>
> >>> Is the numa error a problem?
> >>> Why do OSD daemons not respond to hearthbeats? I mean even when the
> disk
> >> is
> >>> totally loaded with IO, the system itself should answer heathbeats, or
> >> am I
> >>> missing something?
> >>>
> >>> I really hope some of you could send me on the correct way to solve
> this
> >>> nasty problem.
> >>>
> >>> This is how the latest crash looks like
> >>> Mar 07 17:44:15 s3db18 ceph-osd[4530]: 2022-03-07T17:44:15.099+0000
> >>> 7f5f05d2a700 -1 osd.161 489755 set_numa_affinity unable to identify
> >> public
> >>> interface '' numa node: (2) No such file or directory
> >>> ...
> >>> Mar 07 17:49:07 s3db18 ceph-osd[4530]: 2022-03-07T17:49:07.678+0000
> >>> 7f5f05d2a700 -1 osd.161 489774 set_numa_affinity unable to identify
> >> public
> >>> interface '' numa node: (2) No such file or directory
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: *** Caught signal (Aborted) **
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
> >>> thread_name:tp_osd_tp
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
> >>> (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0)
> [0x7f5f0d4623c0]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  2: (pthread_kill()+0x38)
> >>> [0x7f5f0d45ef08]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  3:
> >>> (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char
> const*,
> >>> unsigned long)+0x471) [0x55a699a01201]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  4:
> >>> (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, unsigned
> >>> long, unsigned long)+0x8e) [0x55a699a0199e]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  5:
> >>> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0)
> >>> [0x55a699a224b0]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  6:
> >>> (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55a699a252c4]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  7: (()+0x8609) [0x7f5f0d456609]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  8: (clone()+0x43)
> >> [0x7f5f0cfc0163]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 2022-03-07T17:53:07.387+0000
> >>> 7f5ef1501700 -1 *** Caught signal (Aborted) **
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
> >>> thread_name:tp_osd_tp
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
> >>> (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0)
> [0x7f5f0d4623c0]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  2: (pthread_kill()+0x38)
> >>> [0x7f5f0d45ef08]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  3:
> >>> (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char
> const*,
> >>> unsigned long)+0x471) [0x55a699a01201]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  4:
> >>> (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, unsigned
> >>> long, unsigned long)+0x8e) [0x55a699a0199e]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  5:
> >>> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0)
> >>> [0x55a699a224b0]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  6:
> >>> (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55a699a252c4]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  7: (()+0x8609) [0x7f5f0d456609]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  8: (clone()+0x43)
> >> [0x7f5f0cfc0163]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  NOTE: a copy of the executable,
> >> or
> >>> `objdump -rdS <executable>` is needed to interpret this.
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  -5246>
> >> 2022-03-07T17:49:07.678+0000
> >>> 7f5f05d2a700 -1 osd.161 489774 set_numa_affinity unable to identify
> >> public
> >>> interface '' numa node: (2) No such file or directory
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:      0>
> >> 2022-03-07T17:53:07.387+0000
> >>> 7f5ef1501700 -1 *** Caught signal (Aborted) **
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
> >>> thread_name:tp_osd_tp
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
> >>> (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0)
> [0x7f5f0d4623c0]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  2: (pthread_kill()+0x38)
> >>> [0x7f5f0d45ef08]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  3:
> >>> (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char
> const*,
> >>> unsigned long)+0x471) [0x55a699a01201]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  4:
> >>> (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, unsigned
> >>> long, unsigned long)+0x8e) [0x55a699a0199e]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  5:
> >>> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0)
> >>> [0x55a699a224b0]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  6:
> >>> (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55a699a252c4]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  7: (()+0x8609) [0x7f5f0d456609]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  8: (clone()+0x43)
> >> [0x7f5f0cfc0163]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  NOTE: a copy of the executable,
> >> or
> >>> `objdump -rdS <executable>` is needed to interpret this.
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  -5246>
> >> 2022-03-07T17:49:07.678+0000
> >>> 7f5f05d2a700 -1 osd.161 489774 set_numa_affinity unable to identify
> >> public
> >>> interface '' numa node: (2) No such file or directory
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:      0>
> >> 2022-03-07T17:53:07.387+0000
> >>> 7f5ef1501700 -1 *** Caught signal (Aborted) **
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
> >>> thread_name:tp_osd_tp
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
> >>> (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0)
> [0x7f5f0d4623c0]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  2: (pthread_kill()+0x38)
> >>> [0x7f5f0d45ef08]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  3:
> >>> (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char
> const*,
> >>> unsigned long)+0x471) [0x55a699a01201]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  4:
> >>> (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, unsigned
> >>> long, unsigned long)+0x8e) [0x55a699a0199e]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  5:
> >>> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0)
> >>> [0x55a699a224b0]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  6:
> >>> (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55a699a252c4]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  7: (()+0x8609) [0x7f5f0d456609]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  8: (clone()+0x43)
> >> [0x7f5f0cfc0163]
> >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  NOTE: a copy of the executable,
> >> or
> >>> `objdump -rdS <executable>` is needed to interpret this.
> >>> Mar 07 17:53:09 s3db18 systemd[1]: ceph-osd@161.service: Main process
> >>> exited, code=killed, status=6/ABRT
> >>> Mar 07 17:53:09 s3db18 systemd[1]: ceph-osd@161.service: Failed with
> >> result
> >>> 'signal'.
> >>> Mar 07 17:53:19 s3db18 systemd[1]: ceph-osd@161.service: Scheduled
> >> restart
> >>> job, restart counter is at 1.
> >>> Mar 07 17:53:19 s3db18 systemd[1]: Stopped Ceph object storage daemon
> >>> osd.161.
> >>> Mar 07 17:53:19 s3db18 systemd[1]: Starting Ceph object storage daemon
> >>> osd.161...
> >>> Mar 07 17:53:19 s3db18 systemd[1]: Started Ceph object storage daemon
> >>> osd.161.
> >>> Mar 07 17:53:20 s3db18 ceph-osd[4009440]: 2022-03-07T17:53:20.498+0000
> >>> 7f9617781d80 -1 Falling back to public interface
> >>> Mar 07 17:53:33 s3db18 ceph-osd[4009440]: 2022-03-07T17:53:33.906+0000
> >>> 7f9617781d80 -1 osd.161 489778 log_to_monitors {default=true}
> >>> Mar 07 17:53:34 s3db18 ceph-osd[4009440]: 2022-03-07T17:53:34.206+0000
> >>> 7f96106f2700 -1 osd.161 489778 set_numa_affinity unable to identify
> >> public
> >>> interface '' numa node: (2) No such file or directory
> >>> ...
> >>> Mar 07 18:58:12 s3db18 ceph-osd[4009440]: 2022-03-07T18:58:12.717+0000
> >>> 7f96106f2700 -1 osd.161 489880 set_numa_affinity unable to identify
> >> public
> >>> interface '' numa node: (2) No such file or directory
> >>>
> >>> And this is how an it looks like when OSDs get marked as out:
> >>> Mar 03 19:29:04 s3db13 ceph-osd[5792]: 2022-03-03T19:29:04.857+0000
> >>> 7f16115e0700 -1 osd.97 485814 heartbeat_check: no reply from
> >>> [XX:22::65]:6886 osd.124 since back 2022-03-03T19:28:41.250692+0000
> front
> >>> 2022-03-03T19:28:41.250649+0000 (oldest deadline
> >>> 2022-03-03T19:29:04.150352+0000)
> >>> ...130 time...
> >>> Mar 03 21:55:37 s3db13 ceph-osd[5792]: 2022-03-03T21:55:37.844+0000
> >>> 7f16115e0700 -1 osd.97 486383 heartbeat_check: no reply from
> >>> [XX:22::65]:6941 osd.124 since back 2022-03-03T21:55:12.514627+0000
> front
> >>> 2022-03-03T21:55:12.514649+0000 (oldest deadline
> >>> 2022-03-03T21:55:36.613469+0000)
> >>> Mar 04 00:00:05 s3db13 ceph-osd[5792]: 2022-03-04T00:00:05.035+0000
> >>> 7f1613080700 -1 received  signal: Hangup from killall -q -1 ceph-mon
> >>> ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror  (PID: 1385079)
> >>> UID: 0
> >>> Mar 04 00:00:05 s3db13 ceph-osd[5792]: 2022-03-04T00:00:05.047+0000
> >>> 7f1613080700 -1 received  signal: Hangup from  (PID: 1385080) UID: 0
> >>> Mar 04 00:06:00 s3db13 sudo[1389262]:     ceph : TTY=unknown ; PWD=/ ;
> >>> USER=root ; COMMAND=/usr/sbin/smartctl -a --json=o /dev/sde
> >>> Mar 04 00:06:00 s3db13 sudo[1389262]: pam_unix(sudo:session): session
> >>> opened for user root by (uid=0)
> >>> Mar 04 00:06:00 s3db13 sudo[1389262]: pam_unix(sudo:session): session
> >>> closed for user root
> >>> Mar 04 00:06:01 s3db13 sudo[1389287]:     ceph : TTY=unknown ; PWD=/ ;
> >>> USER=root ; COMMAND=/usr/sbin/nvme ata smart-log-add --json /dev/sde
> >>> Mar 04 00:06:01 s3db13 sudo[1389287]: pam_unix(sudo:session): session
> >>> opened for user root by (uid=0)
> >>> Mar 04 00:06:01 s3db13 sudo[1389287]: pam_unix(sudo:session): session
> >>> closed for user root
> >>> Mar 05 00:00:10 s3db13 ceph-osd[5792]: 2022-03-05T00:00:10.213+0000
> >>> 7f1613080700 -1 received  signal: Hangup from killall -q -1 ceph-mon
> >>> ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror  (PID: 2406262)
> >>> UID: 0
> >>> Mar 05 00:00:10 s3db13 ceph-osd[5792]: 2022-03-05T00:00:10.237+0000
> >>> 7f1613080700 -1 received  signal: Hangup from  (PID: 2406263) UID: 0
> >>> Mar 05 00:08:03 s3db13 sudo[2411721]:     ceph : TTY=unknown ; PWD=/ ;
> >>> USER=root ; COMMAND=/usr/sbin/smartctl -a --json=o /dev/sde
> >>> Mar 05 00:08:03 s3db13 sudo[2411721]: pam_unix(sudo:session): session
> >>> opened for user root by (uid=0)
> >>> Mar 05 00:08:04 s3db13 sudo[2411721]: pam_unix(sudo:session): session
> >>> closed for user root
> >>> Mar 05 00:08:04 s3db13 sudo[2411725]:     ceph : TTY=unknown ; PWD=/ ;
> >>> USER=root ; COMMAND=/usr/sbin/nvme ata smart-log-add --json /dev/sde
> >>> Mar 05 00:08:04 s3db13 sudo[2411725]: pam_unix(sudo:session): session
> >>> opened for user root by (uid=0)
> >>> Mar 05 00:08:04 s3db13 sudo[2411725]: pam_unix(sudo:session): session
> >>> closed for user root
> >>> Mar 05 19:19:49 s3db13 ceph-osd[5792]: 2022-03-05T19:19:49.189+0000
> >>> 7f160fddd700 -1 osd.97 486852 set_numa_affinity unable to identify
> public
> >>> interface '' numa node: (2) No such file or directory
> >>> Mar 05 19:21:18 s3db13 ceph-osd[5792]: 2022-03-05T19:21:18.377+0000
> >>> 7f160fddd700 -1 osd.97 486858 set_numa_affinity unable to identify
> public
> >>> interface '' numa node: (2) No such file or directory
> >>> Mar 05 19:21:45 s3db13 ceph-osd[5792]: 2022-03-05T19:21:45.304+0000
> >>> 7f16115e0700 -1 osd.97 486863 heartbeat_check: no reply from
> >>> [XX:22::60]:6834 osd.171 since back 2022-03-05T19:21:21.762744+0000
> front
> >>> 2022-03-05T19:21:21.762723+0000 (oldest deadline
> >>> 2022-03-05T19:21:45.261347+0000)
> >>> Mar 05 19:21:46 s3db13 ceph-osd[5792]: 2022-03-05T19:21:46.260+0000
> >>> 7f16115e0700 -1 osd.97 486863 heartbeat_check: no reply from
> >>> [XX:22::60]:6834 osd.171 since back 2022-03-05T19:21:21.762744+0000
> front
> >>> 2022-03-05T19:21:21.762723+0000 (oldest deadline
> >>> 2022-03-05T19:21:45.261347+0000)
> >>> Mar 05 19:21:47 s3db13 ceph-osd[5792]: 2022-03-05T19:21:47.252+0000
> >>> 7f16115e0700 -1 osd.97 486863 heartbeat_check: no reply from
> >>> [XX:22::60]:6834 osd.171 since back 2022-03-05T19:21:21.762744+0000
> front
> >>> 2022-03-05T19:21:21.762723+0000 (oldest deadline
> >>> 2022-03-05T19:21:45.261347+0000)
> >>> Mar 05 19:22:59 s3db13 ceph-osd[5792]: 2022-03-05T19:22:59.636+0000
> >>> 7f160fddd700 -1 osd.97 486869 set_numa_affinity unable to identify
> public
> >>> interface '' numa node: (2) No such file or directory
> >>> Mar 05 19:23:33 s3db13 ceph-osd[5792]: 2022-03-05T19:23:33.439+0000
> >>> 7f16115e0700 -1 osd.97 486872 get_health_metrics reporting 2 slow ops,
> >>> oldest is osd_op(client.2304224848.0:3139913 4.d 4.97748d0d (undecoded)
> >>> ondisk+retry+read+known_if_redirected e486872)
> >>> Mar 05 19:23:34 s3db13 ceph-osd[5792]: 2022-03-05T19:23:34.458+0000
> >>> 7f16115e0700 -1 osd.97 486872 get_health_metrics reporting 2 slow ops,
> >>> oldest is osd_op(client.2304224848.0:3139913 4.d 4.97748d0d (undecoded)
> >>> ondisk+retry+read+known_if_redirected e486872)
> >>> Mar 05 19:23:35 s3db13 ceph-osd[5792]: 2022-03-05T19:23:35.434+0000
> >>> 7f16115e0700 -1 osd.97 486872 heartbeat_check: no reply from
> >>> [XX:22::60]:6834 osd.171 since back 2022-03-05T19:23:09.928097+0000
> front
> >>> 2022-03-05T19:23:09.928150+0000 (oldest deadline
> >>> 2022-03-05T19:23:35.227545+0000)
> >>> ...
> >>> Mar 05 19:23:48 s3db13 ceph-osd[5792]: 2022-03-05T19:23:48.386+0000
> >>> 7f16115e0700 -1 osd.97 486872 get_health_metrics reporting 2 slow ops,
> >>> oldest is osd_op(client.2304224848.0:3139913 4.d 4.97748d0d (undecoded)
> >>> ondisk+retry+read+known_if_redirected e486872)
> >>> Mar 05 19:23:49 s3db13 ceph-osd[5792]: 2022-03-05T19:23:49.362+0000
> >>> 7f16115e0700 -1 osd.97 486872 heartbeat_check: no reply from
> >>> [XX:22::60]:6834 osd.171 since back 2022-03-05T19:23:09.928097+0000
> front
> >>> 2022-03-05T19:23:09.928150+0000 (oldest deadline
> >>> 2022-03-05T19:23:35.227545+0000)
> >>> Mar 05 19:23:49 s3db13 ceph-osd[5792]: 2022-03-05T19:23:49.362+0000
> >>> 7f16115e0700 -1 osd.97 486872 get_health_metrics reporting 2 slow ops,
> >>> oldest is osd_op(client.2304224848.0:3139913 4.d 4.97748d0d (undecoded)
> >>> ondisk+retry+read+known_if_redirected e486872)
> >>> Mar 05 19:23:50 s3db13 ceph-osd[5792]: 2022-03-05T19:23:50.358+0000
> >>> 7f16115e0700 -1 osd.97 486873 get_health_metrics reporting 2 slow ops,
> >>> oldest is osd_op(client.2304224848.0:3139913 4.d 4.97748d0d (undecoded)
> >>> ondisk+retry+read+known_if_redirected e486872)
> >>> Mar 05 19:23:51 s3db13 ceph-osd[5792]: 2022-03-05T19:23:51.330+0000
> >>> 7f16115e0700 -1 osd.97 486874 get_health_metrics reporting 2 slow ops,
> >>> oldest is osd_op(client.2304224848.0:3139913 4.d
> 4:b0b12ee9:::gc.22:head
> >>> [call rgw_gc.rgw_gc_queue_list_entries in=46b] snapc 0=[] RETRY=9
> >>> ondisk+retry+read+known_if_redirected e486872)
> >>> Mar 05 19:23:52 s3db13 ceph-osd[5792]: 2022-03-05T19:23:52.326+0000
> >>> 7f16115e0700 -1 osd.97 486874 get_health_metrics reporting 2 slow ops,
> >>> oldest is osd_op(client.2304224848.0:3139913 4.d
> 4:b0b12ee9:::gc.22:head
> >>> [call rgw_gc.rgw_gc_queue_list_entries in=46b] snapc 0=[] RETRY=9
> >>> ondisk+retry+read+known_if_redirected e486872)
> >>> Mar 05 19:23:53 s3db13 ceph-osd[5792]: 2022-03-05T19:23:53.338+0000
> >>> 7f16115e0700 -1 osd.97 486874 get_health_metrics reporting 2 slow ops,
> >>> oldest is osd_op(client.2304224848.0:3139913 4.d
> 4:b0b12ee9:::gc.22:head
> >>> [call rgw_gc.rgw_gc_queue_list_entries in=46b] snapc 0=[] RETRY=9
> >>> ondisk+retry+read+known_if_redirected e486872)
> >>> Mar 05 19:25:02 s3db13 ceph-osd[5792]: 2022-03-05T19:25:02.342+0000
> >>> 7f160fddd700 -1 osd.97 486878 set_numa_affinity unable to identify
> public
> >>> interface '' numa node: (2) No such file or directory
> >>> Mar 05 19:25:33 s3db13 ceph-osd[5792]: 2022-03-05T19:25:33.569+0000
> >>> 7f16115e0700 -1 osd.97 486880 get_health_metrics reporting 2 slow ops,
> >>> oldest is osd_op(client.2304224857.0:4271104 4.d 4.97748d0d (undecoded)
> >>> ondisk+retry+write+known_if_redirected e486879)
> >>> ...
> >>> Mar 05 19:25:44 s3db13 ceph-osd[5792]: 2022-03-05T19:25:44.476+0000
> >>> 7f16115e0700 -1 osd.97 486880 get_health_metrics reporting 3 slow ops,
> >>> oldest is osd_op(client.2304224857.0:4271104 4.d 4.97748d0d (undecoded)
> >>> ondisk+retry+write+known_if_redirected e486879)
> >>> Mar 05 19:25:45 s3db13 ceph-osd[5792]: 2022-03-05T19:25:45.456+0000
> >>> 7f16115e0700 -1 osd.97 486880 heartbeat_check: no reply from
> >>> [XX:22::60]:6834 osd.171 ever on either front or back, first ping sent
> >>> 2022-03-05T19:25:25.281582+0000 (oldest deadline
> >>> 2022-03-05T19:25:45.281582+0000)
> >>> Mar 05 19:25:45 s3db13 ceph-osd[5792]: 2022-03-05T19:25:45.456+0000
> >>> 7f16115e0700 -1 osd.97 486880 get_health_metrics reporting 3 slow ops,
> >>> oldest is osd_op(client.2304224857.0:4271104 4.d 4.97748d0d (undecoded)
> >>> ondisk+retry+write+known_if_redirected e486879)
> >>> ...
> >>> Mar 05 19:26:08 s3db13 ceph-osd[5792]: 2022-03-05T19:26:08.363+0000
> >>> 7f16115e0700 -1 osd.97 486880 get_health_metrics reporting 3 slow ops,
> >>> oldest is osd_op(client.2304224857.0:4271104 4.d 4.97748d0d (undecoded)
> >>> ondisk+retry+write+known_if_redirected e486879)
> >>> Mar 05 19:26:09 s3db13 ceph-osd[5792]: 2022-03-05T19:26:09.371+0000
> >>> 7f16115e0700 -1 osd.97 486880 heartbeat_check: no reply from
> >>> [XX:22::60]:6834 osd.171 ever on either front or back, first ping sent
> >>> 2022-03-05T19:25:25.281582+0000 (oldest deadline
> >>> 2022-03-05T19:25:45.281582+0000)
> >>> Mar 05 19:26:09 s3db13 ceph-osd[5792]: 2022-03-05T19:26:09.375+0000
> >>> 7f16115e0700 -1 osd.97 486880 get_health_metrics reporting 3 slow ops,
> >>> oldest is osd_op(client.2304224857.0:4271104 4.d 4.97748d0d (undecoded)
> >>> ondisk+retry+write+known_if_redirected e486879)
> >>> Mar 05 19:26:10 s3db13 ceph-osd[5792]: 2022-03-05T19:26:10.383+0000
> >>> 7f16115e0700 -1 osd.97 486881 get_health_metrics reporting 3 slow ops,
> >>> oldest is osd_op(client.2304224857.0:4271104 4.d 4.97748d0d (undecoded)
> >>> ondisk+retry+write+known_if_redirected e486879)
> >>> Mar 05 19:26:11 s3db13 ceph-osd[5792]: 2022-03-05T19:26:11.407+0000
> >>> 7f16115e0700 -1 osd.97 486882 get_health_metrics reporting 1 slow ops,
> >>> oldest is osd_op(client.2304224848.0:3139913 4.d
> 4:b0b12ee9:::gc.22:head
> >>> [call rgw_gc.rgw_gc_queue_list_entries in=46b] snapc 0=[] RETRY=11
> >>> ondisk+retry+read+known_if_redirected e486879)
> >>> Mar 05 19:26:12 s3db13 ceph-osd[5792]: 2022-03-05T19:26:12.399+0000
> >>> 7f16115e0700 -1 osd.97 486882 get_health_metrics reporting 1 slow ops,
> >>> oldest is osd_op(client.2304224848.0:3139913 4.d
> 4:b0b12ee9:::gc.22:head
> >>> [call rgw_gc.rgw_gc_queue_list_entries in=46b] snapc 0=[] RETRY=11
> >>> ondisk+retry+read+known_if_redirected e486879)
> >>> Mar 05 19:27:24 s3db13 ceph-osd[5792]: 2022-03-05T19:27:24.975+0000
> >>> 7f160fddd700 -1 osd.97 486887 set_numa_affinity unable to identify
> public
> >>> interface '' numa node: (2) No such file or directory
> >>> Mar 05 19:27:58 s3db13 ceph-osd[5792]: 2022-03-05T19:27:58.114+0000
> >>> 7f16115e0700 -1 osd.97 486890 get_health_metrics reporting 4 slow ops,
> >>> oldest is osd_op(client.2304235452.0:811825 4.d 4.97748d0d (undecoded)
> >>> ondisk+retry+write+known_if_redirected e486889)
> >>> ...
> >>> Mar 05 19:28:08 s3db13 ceph-osd[5792]: 2022-03-05T19:28:08.137+0000
> >>> 7f16115e0700 -1 osd.97 486890 get_health_metrics reporting 4 slow ops,
> >>> oldest is osd_op(client.2304235452.0:811825 4.d 4.97748d0d (undecoded)
> >>> ondisk+retry+write+known_if_redirected e486889)
> >>> Mar 05 19:28:09 s3db13 ceph-osd[5792]: 2022-03-05T19:28:09.125+0000
> >>> 7f16115e0700 -1 osd.97 486890 heartbeat_check: no reply from
> >>> [XX:22::60]:6834 osd.171 ever on either front or back, first ping sent
> >>> 2022-03-05T19:27:48.548094+0000 (oldest deadline
> >>> 2022-03-05T19:28:08.548094+0000)
> >>> Mar 05 19:28:09 s3db13 ceph-osd[5792]: 2022-03-05T19:28:09.125+0000
> >>> 7f16115e0700 -1 osd.97 486890 get_health_metrics reporting 4 slow ops,
> >>> oldest is osd_op(client.2304235452.0:811825 4.d 4.97748d0d (undecoded)
> >>> ondisk+retry+write+known_if_redirected e486889)
> >>> ...
> >>> Mar 05 19:28:29 s3db13 ceph-osd[5792]: 2022-03-05T19:28:29.060+0000
> >>> 7f16115e0700 -1 osd.97 486890 get_health_metrics reporting 4 slow ops,
> >>> oldest is osd_op(client.2304235452.0:811825 4.d 4.97748d0d (undecoded)
> >>> ondisk+retry+write+known_if_redirected e486889)
> >>> Mar 05 19:28:30 s3db13 ceph-osd[5792]: 2022-03-05T19:28:30.040+0000
> >>> 7f16115e0700 -1 osd.97 486890 heartbeat_check: no reply from
> >>> [XX:22::60]:6834 osd.171 ever on either front or back, first ping sent
> >>> 2022-03-05T19:27:48.548094+0000 (oldest deadline
> >>> 2022-03-05T19:28:08.548094+0000)
> >>> Mar 05 19:29:43 s3db13 ceph-osd[5792]: 2022-03-05T19:29:43.696+0000
> >>> 7f1605dc9700 -1 osd.97 486896 _committed_osd_maps marked down 6 >
> >>> osd_max_markdown_count 5 in last 600.000000 seconds, shutting down
> >>> Mar 05 19:29:43 s3db13 ceph-osd[5792]: 2022-03-05T19:29:43.700+0000
> >>> 7f1613080700 -1 received  signal: Interrupt from Kernel ( Could be
> >>> generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
> >>> Mar 05 19:29:43 s3db13 ceph-osd[5792]: 2022-03-05T19:29:43.700+0000
> >>> 7f1613080700 -1 osd.97 486896 *** Got signal Interrupt ***
> >>> Mar 05 19:29:43 s3db13 ceph-osd[5792]: 2022-03-05T19:29:43.700+0000
> >>> 7f1613080700 -1 osd.97 486896 *** Immediate shutdown
> >>> (osd_fast_shutdown=true) ***
> >>> Mar 05 19:29:44 s3db13 systemd[1]: ceph-osd@97.service: Succeeded.
> >>> Mar 05 19:29:54 s3db13 systemd[1]: ceph-osd@97.service: Scheduled
> >> restart
> >>> job, restart counter is at 1.
> >>> Mar 05 19:29:54 s3db13 systemd[1]: Stopped Ceph object storage daemon
> >>> osd.97.
> >>> Mar 05 19:29:54 s3db13 systemd[1]: Starting Ceph object storage daemon
> >>> osd.97...
> >>> Mar 05 19:29:54 s3db13 systemd[1]: Started Ceph object storage daemon
> >>> osd.97.
> >>> Mar 05 19:29:55 s3db13 ceph-osd[3236773]: 2022-03-05T19:29:55.116+0000
> >>> 7f5852f74d80 -1 Falling back to public interface
> >>> Mar 05 19:30:34 s3db13 ceph-osd[3236773]: 2022-03-05T19:30:34.746+0000
> >>> 7f5852f74d80 -1 osd.97 486896 log_to_monitors {default=true}
> >>> --
> >>> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend
> im
> >>> groÃƒ¼en Saal.
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groÃƒ¼en Saal.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx