Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

Francois Legrand <fleg@xxxxxxxxxxxxxx> · Tue, 8 Mar 2022 12:23:32 +0100

Hi,

The last 2 osd I recreated were on december 30 and february 8.

I totally agree that ssd cache are a terrible spof. I think that's an 
option if you use 1 ssd/nvme for 1 or 2 osd, but the cost is then very 
high. Using 1 ssd for 10 osd increase the risk for almost no gain 
because the ssd is 10 times faster but has 10 times more access !
Indeed, we did some benches with nvme for the wal db (1 nvme for ~10 
osds), and the gain was not tremendous, so we decided not use them !
F.

Le 08/03/2022 à 11:57, Boris Behrens a écrit :
Hi Francois,

thanks for the reminder. We offline compacted all of the OSDs when we 
reinstalled the hosts with the new OS.
But actually reinstalling them was never on my list.

I could try that and in the same go I can remove all the cache SSDs 
(when one SSD share the cache for 10 OSDs this is a horrible SPOF) and 
reuse the SSDs as OSDs for the smaller pools in a RGW (like log and meta).

How long ago did you recreate the earliest OSD?

Cheers
 Boris

Am Di., 8. März 2022 um 10:03 Uhr schrieb Francois Legrand 
<fleg@xxxxxxxxxxxxxx>:

    Hi,
    We also had this kind of problems after upgrading to octopus.
    Maybe you
    can play with the hearthbeat grace time (
    https://docs.ceph.com/en/latest/rados/configuration/mon-osd-interaction/

    ) to tell osds to wait a little more before declaring another osd
    down !
    We also try to fix the problem by manually compact the down osd
    (something like : systemctl stop ceph-osd@74; sleep 10;
    ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-74 compact;
    systemctl start ceph-osd@74).
    This worked a few times, but some osd went down again, thus we simply
    wait for the datas to be reconstructed elswhere and then reinstall
    the
    dead osd :
    ceph osd destroy 74 --yes-i-really-mean-it
    ceph-volume lvm zap /dev/sde --destroy
    ceph-volume lvm create --osd-id 74 --data /dev/sde

    This seems to fix the issue for us (up to now).

    F.

    Le 08/03/2022 à 09:35, Boris Behrens a écrit :
    > Yes, this is something we know and we disabled it, because we
    ran into the
    > problem that PGs went unavailable when two or more OSDs went
    offline.
    >
    > I am searching for the reason WHY this happens.
    > Currently we have set the service file to restart=always and
    removed the
    > StartLimitBurst from the service file.
    >
    > We just don't understand why the OSDs don't answer the
    heathbeat. The OSDs
    > that are flapping are random in terms of Host, Disksize, having SSD
    > block.db or not.
    > Network connectivity issues is something that I would rule out,
    because the
    > cluster went from "nothing ever happens except IOPS" to "random
    OSDs are
    > marked DOWN until they kill themself" with the update from
    nautilus to
    > octopus.
    >
    > I am out of ideas and hoped this was a bug in 15.2.15, but after
    the update
    > things got worse (happen more often).
    > We tried to:
    > * disable swap
    > * more swap
    > * disable bluefs_buffered_io
    > * disable write cache for all disks
    > * disable scrubbing
    > * reinstall with new OS (from centos7 to ubuntu 20.04)
    > * disable cluster_network (so there is only one way to communicate)
    > * increase txqueuelen on the network interfaces
    > * everything together
    >
    >
    > What we try next: add more SATA controllers, so there are not 24
    disks
    > attached to a single controller, but I doubt this will help.
    >
    > Cheers
    >   Boris
    >
    >
    >
    > Am Di., 8. März 2022 um 09:10 Uhr schrieb Dan van der Ster <
    > dvanders@xxxxxxxxx>:
    >
    >> Here's the reason they exit:
    >>
    >> 7f1605dc9700 -1 osd.97 486896 _committed_osd_maps marked down 6 >
    >> osd_max_markdown_count 5 in last 600.000000 seconds, shutting down
    >>
    >> If an osd flaps (marked down, then up) 6 times in 10 minutes, it
    >> exits. (This is a safety measure).
    >>
    >> It's normally caused by a network issue -- other OSDs are
    telling the
    >> mon that he is down, but then the OSD himself tells the mon
    that he's
    >> up!
    >>
    >> Cheers, Dan
    >>
    >> On Mon, Mar 7, 2022 at 10:36 PM Boris Behrens <bb@xxxxxxxxx> wrote:
    >>> Hi,
    >>>
    >>> we've had the problem with OSDs marked as offline since we
    updated to
    >>> octopus and hope the problem would be fixed with the latest
    patch. We
    >> have
    >>> this kind of problem only with octopus and there only with the
    big s3
    >>> cluster.
    >>> * Hosts are all Ubuntu 20,04 and we've set the txqueuelen to 10k
    >>> * Network interfaces are 20gbit (2x10 in a 802.3ad encap3+4 bond)
    >>> * We only use the frontend network.
    >>> * All disks are spinning, some have block.db devices.
    >>> * All disks are bluestore
    >>> * configs are mostly defaults
    >>> * we've set the OSDs to restart=always without a limit,
    because we had
    >> the
    >>> problem with unavailable PGs when two OSDs are marked as
    offline and the
    >>> share PGs.
    >>>
    >>> But since we installed the latest patch we are experiencing
    more OSD
    >> downs
    >>> and even crashes.
    >>> I tried to remove as much duplicated lines as possible.
    >>>
    >>> Is the numa error a problem?
    >>> Why do OSD daemons not respond to hearthbeats? I mean even
    when the disk
    >> is
    >>> totally loaded with IO, the system itself should answer
    heathbeats, or
    >> am I
    >>> missing something?
    >>>
    >>> I really hope some of you could send me on the correct way to
    solve this
    >>> nasty problem.
    >>>
    >>> This is how the latest crash looks like
    >>> Mar 07 17:44:15 s3db18 ceph-osd[4530]:
    2022-03-07T17:44:15.099+0000
    >>> 7f5f05d2a700 -1 osd.161 489755 set_numa_affinity unable to
    identify
    >> public
    >>> interface '' numa node: (2) No such file or directory
    >>> ...
    >>> Mar 07 17:49:07 s3db18 ceph-osd[4530]:
    2022-03-07T17:49:07.678+0000
    >>> 7f5f05d2a700 -1 osd.161 489774 set_numa_affinity unable to
    identify
    >> public
    >>> interface '' numa node: (2) No such file or directory
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: *** Caught signal
    (Aborted) **
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
    >>> thread_name:tp_osd_tp
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
    >>> (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0)
    [0x7f5f0d4623c0]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  2: (pthread_kill()+0x38)
    >>> [0x7f5f0d45ef08]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  3:
    >>> (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*,
    char const*,
    >>> unsigned long)+0x471) [0x55a699a01201]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  4:
    >>> (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*,
    unsigned
    >>> long, unsigned long)+0x8e) [0x55a699a0199e]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  5:
    >>> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0)
    >>> [0x55a699a224b0]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  6:
    >>> (ShardedThreadPool::WorkThreadSharded::entry()+0x14)
    [0x55a699a252c4]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  7: (()+0x8609)
    [0x7f5f0d456609]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  8: (clone()+0x43)
    >> [0x7f5f0cfc0163]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:
    2022-03-07T17:53:07.387+0000
    >>> 7f5ef1501700 -1 *** Caught signal (Aborted) **
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
    >>> thread_name:tp_osd_tp
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
    >>> (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0)
    [0x7f5f0d4623c0]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  2: (pthread_kill()+0x38)
    >>> [0x7f5f0d45ef08]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  3:
    >>> (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*,
    char const*,
    >>> unsigned long)+0x471) [0x55a699a01201]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  4:
    >>> (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*,
    unsigned
    >>> long, unsigned long)+0x8e) [0x55a699a0199e]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  5:
    >>> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0)
    >>> [0x55a699a224b0]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  6:
    >>> (ShardedThreadPool::WorkThreadSharded::entry()+0x14)
    [0x55a699a252c4]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  7: (()+0x8609)
    [0x7f5f0d456609]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  8: (clone()+0x43)
    >> [0x7f5f0cfc0163]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  NOTE: a copy of the
    executable,
    >> or
    >>> `objdump -rdS <executable>` is needed to interpret this.
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  -5246>
    >> 2022-03-07T17:49:07.678+0000
    >>> 7f5f05d2a700 -1 osd.161 489774 set_numa_affinity unable to
    identify
    >> public
    >>> interface '' numa node: (2) No such file or directory
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:      0>
    >> 2022-03-07T17:53:07.387+0000
    >>> 7f5ef1501700 -1 *** Caught signal (Aborted) **
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
    >>> thread_name:tp_osd_tp
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
    >>> (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0)
    [0x7f5f0d4623c0]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  2: (pthread_kill()+0x38)
    >>> [0x7f5f0d45ef08]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  3:
    >>> (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*,
    char const*,
    >>> unsigned long)+0x471) [0x55a699a01201]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  4:
    >>> (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*,
    unsigned
    >>> long, unsigned long)+0x8e) [0x55a699a0199e]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  5:
    >>> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0)
    >>> [0x55a699a224b0]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  6:
    >>> (ShardedThreadPool::WorkThreadSharded::entry()+0x14)
    [0x55a699a252c4]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  7: (()+0x8609)
    [0x7f5f0d456609]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  8: (clone()+0x43)
    >> [0x7f5f0cfc0163]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  NOTE: a copy of the
    executable,
    >> or
    >>> `objdump -rdS <executable>` is needed to interpret this.
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  -5246>
    >> 2022-03-07T17:49:07.678+0000
    >>> 7f5f05d2a700 -1 osd.161 489774 set_numa_affinity unable to
    identify
    >> public
    >>> interface '' numa node: (2) No such file or directory
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:      0>
    >> 2022-03-07T17:53:07.387+0000
    >>> 7f5ef1501700 -1 *** Caught signal (Aborted) **
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
    >>> thread_name:tp_osd_tp
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
    >>> (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0)
    [0x7f5f0d4623c0]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  2: (pthread_kill()+0x38)
    >>> [0x7f5f0d45ef08]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  3:
    >>> (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*,
    char const*,
    >>> unsigned long)+0x471) [0x55a699a01201]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  4:
    >>> (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*,
    unsigned
    >>> long, unsigned long)+0x8e) [0x55a699a0199e]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  5:
    >>> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0)
    >>> [0x55a699a224b0]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  6:
    >>> (ShardedThreadPool::WorkThreadSharded::entry()+0x14)
    [0x55a699a252c4]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  7: (()+0x8609)
    [0x7f5f0d456609]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  8: (clone()+0x43)
    >> [0x7f5f0cfc0163]
    >>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  NOTE: a copy of the
    executable,
    >> or
    >>> `objdump -rdS <executable>` is needed to interpret this.
    >>> Mar 07 17:53:09 s3db18 systemd[1]: ceph-osd@161.service: Main
    process
    >>> exited, code=killed, status=6/ABRT
    >>> Mar 07 17:53:09 s3db18 systemd[1]: ceph-osd@161.service:
    Failed with
    >> result
    >>> 'signal'.
    >>> Mar 07 17:53:19 s3db18 systemd[1]: ceph-osd@161.service: Scheduled
    >> restart
    >>> job, restart counter is at 1.
    >>> Mar 07 17:53:19 s3db18 systemd[1]: Stopped Ceph object storage
    daemon
    >>> osd.161.
    >>> Mar 07 17:53:19 s3db18 systemd[1]: Starting Ceph object
    storage daemon
    >>> osd.161...
    >>> Mar 07 17:53:19 s3db18 systemd[1]: Started Ceph object storage
    daemon
    >>> osd.161.
    >>> Mar 07 17:53:20 s3db18 ceph-osd[4009440]:
    2022-03-07T17:53:20.498+0000
    >>> 7f9617781d80 -1 Falling back to public interface
    >>> Mar 07 17:53:33 s3db18 ceph-osd[4009440]:
    2022-03-07T17:53:33.906+0000
    >>> 7f9617781d80 -1 osd.161 489778 log_to_monitors {default=true}
    >>> Mar 07 17:53:34 s3db18 ceph-osd[4009440]:
    2022-03-07T17:53:34.206+0000
    >>> 7f96106f2700 -1 osd.161 489778 set_numa_affinity unable to
    identify
    >> public
    >>> interface '' numa node: (2) No such file or directory
    >>> ...
    >>> Mar 07 18:58:12 s3db18 ceph-osd[4009440]:
    2022-03-07T18:58:12.717+0000
    >>> 7f96106f2700 -1 osd.161 489880 set_numa_affinity unable to
    identify
    >> public
    >>> interface '' numa node: (2) No such file or directory
    >>>
    >>> And this is how an it looks like when OSDs get marked as out:
    >>> Mar 03 19:29:04 s3db13 ceph-osd[5792]:
    2022-03-03T19:29:04.857+0000
    >>> 7f16115e0700 -1 osd.97 485814 heartbeat_check: no reply from
    >>> [XX:22::65]:6886 osd.124 since back
    2022-03-03T19:28:41.250692+0000 front
    >>> 2022-03-03T19:28:41.250649+0000 (oldest deadline
    >>> 2022-03-03T19:29:04.150352+0000)
    >>> ...130 time...
    >>> Mar 03 21:55:37 s3db13 ceph-osd[5792]:
    2022-03-03T21:55:37.844+0000
    >>> 7f16115e0700 -1 osd.97 486383 heartbeat_check: no reply from
    >>> [XX:22::65]:6941 osd.124 since back
    2022-03-03T21:55:12.514627+0000 front
    >>> 2022-03-03T21:55:12.514649+0000 (oldest deadline
    >>> 2022-03-03T21:55:36.613469+0000)
    >>> Mar 04 00:00:05 s3db13 ceph-osd[5792]:
    2022-03-04T00:00:05.035+0000
    >>> 7f1613080700 -1 received  signal: Hangup from killall -q -1
    ceph-mon
    >>> ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror  (PID:
    1385079)
    >>> UID: 0
    >>> Mar 04 00:00:05 s3db13 ceph-osd[5792]:
    2022-03-04T00:00:05.047+0000
    >>> 7f1613080700 -1 received  signal: Hangup from (PID: 1385080)
    UID: 0
    >>> Mar 04 00:06:00 s3db13 sudo[1389262]:     ceph : TTY=unknown ;
    PWD=/ ;
    >>> USER=root ; COMMAND=/usr/sbin/smartctl -a --json=o /dev/sde
    >>> Mar 04 00:06:00 s3db13 sudo[1389262]: pam_unix(sudo:session):
    session
    >>> opened for user root by (uid=0)
    >>> Mar 04 00:06:00 s3db13 sudo[1389262]: pam_unix(sudo:session):
    session
    >>> closed for user root
    >>> Mar 04 00:06:01 s3db13 sudo[1389287]:     ceph : TTY=unknown ;
    PWD=/ ;
    >>> USER=root ; COMMAND=/usr/sbin/nvme ata smart-log-add --json
    /dev/sde
    >>> Mar 04 00:06:01 s3db13 sudo[1389287]: pam_unix(sudo:session):
    session
    >>> opened for user root by (uid=0)
    >>> Mar 04 00:06:01 s3db13 sudo[1389287]: pam_unix(sudo:session):
    session
    >>> closed for user root
    >>> Mar 05 00:00:10 s3db13 ceph-osd[5792]:
    2022-03-05T00:00:10.213+0000
    >>> 7f1613080700 -1 received  signal: Hangup from killall -q -1
    ceph-mon
    >>> ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror  (PID:
    2406262)
    >>> UID: 0
    >>> Mar 05 00:00:10 s3db13 ceph-osd[5792]:
    2022-03-05T00:00:10.237+0000
    >>> 7f1613080700 -1 received  signal: Hangup from (PID: 2406263)
    UID: 0
    >>> Mar 05 00:08:03 s3db13 sudo[2411721]:     ceph : TTY=unknown ;
    PWD=/ ;
    >>> USER=root ; COMMAND=/usr/sbin/smartctl -a --json=o /dev/sde
    >>> Mar 05 00:08:03 s3db13 sudo[2411721]: pam_unix(sudo:session):
    session
    >>> opened for user root by (uid=0)
    >>> Mar 05 00:08:04 s3db13 sudo[2411721]: pam_unix(sudo:session):
    session
    >>> closed for user root
    >>> Mar 05 00:08:04 s3db13 sudo[2411725]:     ceph : TTY=unknown ;
    PWD=/ ;
    >>> USER=root ; COMMAND=/usr/sbin/nvme ata smart-log-add --json
    /dev/sde
    >>> Mar 05 00:08:04 s3db13 sudo[2411725]: pam_unix(sudo:session):
    session
    >>> opened for user root by (uid=0)
    >>> Mar 05 00:08:04 s3db13 sudo[2411725]: pam_unix(sudo:session):
    session
    >>> closed for user root
    >>> Mar 05 19:19:49 s3db13 ceph-osd[5792]:
    2022-03-05T19:19:49.189+0000
    >>> 7f160fddd700 -1 osd.97 486852 set_numa_affinity unable to
    identify public
    >>> interface '' numa node: (2) No such file or directory
    >>> Mar 05 19:21:18 s3db13 ceph-osd[5792]:
    2022-03-05T19:21:18.377+0000
    >>> 7f160fddd700 -1 osd.97 486858 set_numa_affinity unable to
    identify public
    >>> interface '' numa node: (2) No such file or directory
    >>> Mar 05 19:21:45 s3db13 ceph-osd[5792]:
    2022-03-05T19:21:45.304+0000
    >>> 7f16115e0700 -1 osd.97 486863 heartbeat_check: no reply from
    >>> [XX:22::60]:6834 osd.171 since back
    2022-03-05T19:21:21.762744+0000 front
    >>> 2022-03-05T19:21:21.762723+0000 (oldest deadline
    >>> 2022-03-05T19:21:45.261347+0000)
    >>> Mar 05 19:21:46 s3db13 ceph-osd[5792]:
    2022-03-05T19:21:46.260+0000
    >>> 7f16115e0700 -1 osd.97 486863 heartbeat_check: no reply from
    >>> [XX:22::60]:6834 osd.171 since back
    2022-03-05T19:21:21.762744+0000 front
    >>> 2022-03-05T19:21:21.762723+0000 (oldest deadline
    >>> 2022-03-05T19:21:45.261347+0000)
    >>> Mar 05 19:21:47 s3db13 ceph-osd[5792]:
    2022-03-05T19:21:47.252+0000
    >>> 7f16115e0700 -1 osd.97 486863 heartbeat_check: no reply from
    >>> [XX:22::60]:6834 osd.171 since back
    2022-03-05T19:21:21.762744+0000 front
    >>> 2022-03-05T19:21:21.762723+0000 (oldest deadline
    >>> 2022-03-05T19:21:45.261347+0000)
    >>> Mar 05 19:22:59 s3db13 ceph-osd[5792]:
    2022-03-05T19:22:59.636+0000
    >>> 7f160fddd700 -1 osd.97 486869 set_numa_affinity unable to
    identify public
    >>> interface '' numa node: (2) No such file or directory
    >>> Mar 05 19:23:33 s3db13 ceph-osd[5792]:
    2022-03-05T19:23:33.439+0000
    >>> 7f16115e0700 -1 osd.97 486872 get_health_metrics reporting 2
    slow ops,
    >>> oldest is osd_op(client.2304224848.0:3139913 4.d 4.97748d0d
    (undecoded)
    >>> ondisk+retry+read+known_if_redirected e486872)
    >>> Mar 05 19:23:34 s3db13 ceph-osd[5792]:
    2022-03-05T19:23:34.458+0000
    >>> 7f16115e0700 -1 osd.97 486872 get_health_metrics reporting 2
    slow ops,
    >>> oldest is osd_op(client.2304224848.0:3139913 4.d 4.97748d0d
    (undecoded)
    >>> ondisk+retry+read+known_if_redirected e486872)
    >>> Mar 05 19:23:35 s3db13 ceph-osd[5792]:
    2022-03-05T19:23:35.434+0000
    >>> 7f16115e0700 -1 osd.97 486872 heartbeat_check: no reply from
    >>> [XX:22::60]:6834 osd.171 since back
    2022-03-05T19:23:09.928097+0000 front
    >>> 2022-03-05T19:23:09.928150+0000 (oldest deadline
    >>> 2022-03-05T19:23:35.227545+0000)
    >>> ...
    >>> Mar 05 19:23:48 s3db13 ceph-osd[5792]:
    2022-03-05T19:23:48.386+0000
    >>> 7f16115e0700 -1 osd.97 486872 get_health_metrics reporting 2
    slow ops,
    >>> oldest is osd_op(client.2304224848.0:3139913 4.d 4.97748d0d
    (undecoded)
    >>> ondisk+retry+read+known_if_redirected e486872)
    >>> Mar 05 19:23:49 s3db13 ceph-osd[5792]:
    2022-03-05T19:23:49.362+0000
    >>> 7f16115e0700 -1 osd.97 486872 heartbeat_check: no reply from
    >>> [XX:22::60]:6834 osd.171 since back
    2022-03-05T19:23:09.928097+0000 front
    >>> 2022-03-05T19:23:09.928150+0000 (oldest deadline
    >>> 2022-03-05T19:23:35.227545+0000)
    >>> Mar 05 19:23:49 s3db13 ceph-osd[5792]:
    2022-03-05T19:23:49.362+0000
    >>> 7f16115e0700 -1 osd.97 486872 get_health_metrics reporting 2
    slow ops,
    >>> oldest is osd_op(client.2304224848.0:3139913 4.d 4.97748d0d
    (undecoded)
    >>> ondisk+retry+read+known_if_redirected e486872)
    >>> Mar 05 19:23:50 s3db13 ceph-osd[5792]:
    2022-03-05T19:23:50.358+0000
    >>> 7f16115e0700 -1 osd.97 486873 get_health_metrics reporting 2
    slow ops,
    >>> oldest is osd_op(client.2304224848.0:3139913 4.d 4.97748d0d
    (undecoded)
    >>> ondisk+retry+read+known_if_redirected e486872)
    >>> Mar 05 19:23:51 s3db13 ceph-osd[5792]:
    2022-03-05T19:23:51.330+0000
    >>> 7f16115e0700 -1 osd.97 486874 get_health_metrics reporting 2
    slow ops,
    >>> oldest is osd_op(client.2304224848.0:3139913 4.d
    4:b0b12ee9:::gc.22:head
    >>> [call rgw_gc.rgw_gc_queue_list_entries in=46b] snapc 0=[] RETRY=9
    >>> ondisk+retry+read+known_if_redirected e486872)
    >>> Mar 05 19:23:52 s3db13 ceph-osd[5792]:
    2022-03-05T19:23:52.326+0000
    >>> 7f16115e0700 -1 osd.97 486874 get_health_metrics reporting 2
    slow ops,
    >>> oldest is osd_op(client.2304224848.0:3139913 4.d
    4:b0b12ee9:::gc.22:head
    >>> [call rgw_gc.rgw_gc_queue_list_entries in=46b] snapc 0=[] RETRY=9
    >>> ondisk+retry+read+known_if_redirected e486872)
    >>> Mar 05 19:23:53 s3db13 ceph-osd[5792]:
    2022-03-05T19:23:53.338+0000
    >>> 7f16115e0700 -1 osd.97 486874 get_health_metrics reporting 2
    slow ops,
    >>> oldest is osd_op(client.2304224848.0:3139913 4.d
    4:b0b12ee9:::gc.22:head
    >>> [call rgw_gc.rgw_gc_queue_list_entries in=46b] snapc 0=[] RETRY=9
    >>> ondisk+retry+read+known_if_redirected e486872)
    >>> Mar 05 19:25:02 s3db13 ceph-osd[5792]:
    2022-03-05T19:25:02.342+0000
    >>> 7f160fddd700 -1 osd.97 486878 set_numa_affinity unable to
    identify public
    >>> interface '' numa node: (2) No such file or directory
    >>> Mar 05 19:25:33 s3db13 ceph-osd[5792]:
    2022-03-05T19:25:33.569+0000
    >>> 7f16115e0700 -1 osd.97 486880 get_health_metrics reporting 2
    slow ops,
    >>> oldest is osd_op(client.2304224857.0:4271104 4.d 4.97748d0d
    (undecoded)
    >>> ondisk+retry+write+known_if_redirected e486879)
    >>> ...
    >>> Mar 05 19:25:44 s3db13 ceph-osd[5792]:
    2022-03-05T19:25:44.476+0000
    >>> 7f16115e0700 -1 osd.97 486880 get_health_metrics reporting 3
    slow ops,
    >>> oldest is osd_op(client.2304224857.0:4271104 4.d 4.97748d0d
    (undecoded)
    >>> ondisk+retry+write+known_if_redirected e486879)
    >>> Mar 05 19:25:45 s3db13 ceph-osd[5792]:
    2022-03-05T19:25:45.456+0000
    >>> 7f16115e0700 -1 osd.97 486880 heartbeat_check: no reply from
    >>> [XX:22::60]:6834 osd.171 ever on either front or back, first
    ping sent
    >>> 2022-03-05T19:25:25.281582+0000 (oldest deadline
    >>> 2022-03-05T19:25:45.281582+0000)
    >>> Mar 05 19:25:45 s3db13 ceph-osd[5792]:
    2022-03-05T19:25:45.456+0000
    >>> 7f16115e0700 -1 osd.97 486880 get_health_metrics reporting 3
    slow ops,
    >>> oldest is osd_op(client.2304224857.0:4271104 4.d 4.97748d0d
    (undecoded)
    >>> ondisk+retry+write+known_if_redirected e486879)
    >>> ...
    >>> Mar 05 19:26:08 s3db13 ceph-osd[5792]:
    2022-03-05T19:26:08.363+0000
    >>> 7f16115e0700 -1 osd.97 486880 get_health_metrics reporting 3
    slow ops,
    >>> oldest is osd_op(client.2304224857.0:4271104 4.d 4.97748d0d
    (undecoded)
    >>> ondisk+retry+write+known_if_redirected e486879)
    >>> Mar 05 19:26:09 s3db13 ceph-osd[5792]:
    2022-03-05T19:26:09.371+0000
    >>> 7f16115e0700 -1 osd.97 486880 heartbeat_check: no reply from
    >>> [XX:22::60]:6834 osd.171 ever on either front or back, first
    ping sent
    >>> 2022-03-05T19:25:25.281582+0000 (oldest deadline
    >>> 2022-03-05T19:25:45.281582+0000)
    >>> Mar 05 19:26:09 s3db13 ceph-osd[5792]:
    2022-03-05T19:26:09.375+0000
    >>> 7f16115e0700 -1 osd.97 486880 get_health_metrics reporting 3
    slow ops,
    >>> oldest is osd_op(client.2304224857.0:4271104 4.d 4.97748d0d
    (undecoded)
    >>> ondisk+retry+write+known_if_redirected e486879)
    >>> Mar 05 19:26:10 s3db13 ceph-osd[5792]:
    2022-03-05T19:26:10.383+0000
    >>> 7f16115e0700 -1 osd.97 486881 get_health_metrics reporting 3
    slow ops,
    >>> oldest is osd_op(client.2304224857.0:4271104 4.d 4.97748d0d
    (undecoded)
    >>> ondisk+retry+write+known_if_redirected e486879)
    >>> Mar 05 19:26:11 s3db13 ceph-osd[5792]:
    2022-03-05T19:26:11.407+0000
    >>> 7f16115e0700 -1 osd.97 486882 get_health_metrics reporting 1
    slow ops,
    >>> oldest is osd_op(client.2304224848.0:3139913 4.d
    4:b0b12ee9:::gc.22:head
    >>> [call rgw_gc.rgw_gc_queue_list_entries in=46b] snapc 0=[] RETRY=11
    >>> ondisk+retry+read+known_if_redirected e486879)
    >>> Mar 05 19:26:12 s3db13 ceph-osd[5792]:
    2022-03-05T19:26:12.399+0000
    >>> 7f16115e0700 -1 osd.97 486882 get_health_metrics reporting 1
    slow ops,
    >>> oldest is osd_op(client.2304224848.0:3139913 4.d
    4:b0b12ee9:::gc.22:head
    >>> [call rgw_gc.rgw_gc_queue_list_entries in=46b] snapc 0=[] RETRY=11
    >>> ondisk+retry+read+known_if_redirected e486879)
    >>> Mar 05 19:27:24 s3db13 ceph-osd[5792]:
    2022-03-05T19:27:24.975+0000
    >>> 7f160fddd700 -1 osd.97 486887 set_numa_affinity unable to
    identify public
    >>> interface '' numa node: (2) No such file or directory
    >>> Mar 05 19:27:58 s3db13 ceph-osd[5792]:
    2022-03-05T19:27:58.114+0000
    >>> 7f16115e0700 -1 osd.97 486890 get_health_metrics reporting 4
    slow ops,
    >>> oldest is osd_op(client.2304235452.0:811825 4.d 4.97748d0d
    (undecoded)
    >>> ondisk+retry+write+known_if_redirected e486889)
    >>> ...
    >>> Mar 05 19:28:08 s3db13 ceph-osd[5792]:
    2022-03-05T19:28:08.137+0000
    >>> 7f16115e0700 -1 osd.97 486890 get_health_metrics reporting 4
    slow ops,
    >>> oldest is osd_op(client.2304235452.0:811825 4.d 4.97748d0d
    (undecoded)
    >>> ondisk+retry+write+known_if_redirected e486889)
    >>> Mar 05 19:28:09 s3db13 ceph-osd[5792]:
    2022-03-05T19:28:09.125+0000
    >>> 7f16115e0700 -1 osd.97 486890 heartbeat_check: no reply from
    >>> [XX:22::60]:6834 osd.171 ever on either front or back, first
    ping sent
    >>> 2022-03-05T19:27:48.548094+0000 (oldest deadline
    >>> 2022-03-05T19:28:08.548094+0000)
    >>> Mar 05 19:28:09 s3db13 ceph-osd[5792]:
    2022-03-05T19:28:09.125+0000
    >>> 7f16115e0700 -1 osd.97 486890 get_health_metrics reporting 4
    slow ops,
    >>> oldest is osd_op(client.2304235452.0:811825 4.d 4.97748d0d
    (undecoded)
    >>> ondisk+retry+write+known_if_redirected e486889)
    >>> ...
    >>> Mar 05 19:28:29 s3db13 ceph-osd[5792]:
    2022-03-05T19:28:29.060+0000
    >>> 7f16115e0700 -1 osd.97 486890 get_health_metrics reporting 4
    slow ops,
    >>> oldest is osd_op(client.2304235452.0:811825 4.d 4.97748d0d
    (undecoded)
    >>> ondisk+retry+write+known_if_redirected e486889)
    >>> Mar 05 19:28:30 s3db13 ceph-osd[5792]:
    2022-03-05T19:28:30.040+0000
    >>> 7f16115e0700 -1 osd.97 486890 heartbeat_check: no reply from
    >>> [XX:22::60]:6834 osd.171 ever on either front or back, first
    ping sent
    >>> 2022-03-05T19:27:48.548094+0000 (oldest deadline
    >>> 2022-03-05T19:28:08.548094+0000)
    >>> Mar 05 19:29:43 s3db13 ceph-osd[5792]:
    2022-03-05T19:29:43.696+0000
    >>> 7f1605dc9700 -1 osd.97 486896 _committed_osd_maps marked down 6 >
    >>> osd_max_markdown_count 5 in last 600.000000 seconds, shutting down
    >>> Mar 05 19:29:43 s3db13 ceph-osd[5792]:
    2022-03-05T19:29:43.700+0000
    >>> 7f1613080700 -1 received  signal: Interrupt from Kernel ( Could be
    >>> generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
    >>> Mar 05 19:29:43 s3db13 ceph-osd[5792]:
    2022-03-05T19:29:43.700+0000
    >>> 7f1613080700 -1 osd.97 486896 *** Got signal Interrupt ***
    >>> Mar 05 19:29:43 s3db13 ceph-osd[5792]:
    2022-03-05T19:29:43.700+0000
    >>> 7f1613080700 -1 osd.97 486896 *** Immediate shutdown
    >>> (osd_fast_shutdown=true) ***
    >>> Mar 05 19:29:44 s3db13 systemd[1]: ceph-osd@97.service: Succeeded.
    >>> Mar 05 19:29:54 s3db13 systemd[1]: ceph-osd@97.service: Scheduled
    >> restart
    >>> job, restart counter is at 1.
    >>> Mar 05 19:29:54 s3db13 systemd[1]: Stopped Ceph object storage
    daemon
    >>> osd.97.
    >>> Mar 05 19:29:54 s3db13 systemd[1]: Starting Ceph object
    storage daemon
    >>> osd.97...
    >>> Mar 05 19:29:54 s3db13 systemd[1]: Started Ceph object storage
    daemon
    >>> osd.97.
    >>> Mar 05 19:29:55 s3db13 ceph-osd[3236773]:
    2022-03-05T19:29:55.116+0000
    >>> 7f5852f74d80 -1 Falling back to public interface
    >>> Mar 05 19:30:34 s3db13 ceph-osd[3236773]:
    2022-03-05T19:30:34.746+0000
    >>> 7f5852f74d80 -1 osd.97 486896 log_to_monitors {default=true}
    >>> --
    >>> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal
    abweichend im
    >>> groÃƒ¼en Saal.
    >>> _______________________________________________
    >>> ceph-users mailing list -- ceph-users@xxxxxxx
    >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
    >

    _______________________________________________
    ceph-users mailing list -- ceph-users@xxxxxxx
    To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend 
im groÃƒ¼en Saal.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx