Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Unfortunately there is no silver bullet here so far. Just one note after looking at your configuration - I would strongly encourage you to add SSD disks for spinner-only OSDs.

Particularly when they are used for s3 payload which is pretty DB intensive.


Thanks,

Igor

On 3/23/2022 5:03 PM, Boris Behrens wrote:
Hi Igor,
yes, I've compacted them all.

So is there a solution for the problem, because I can imagine this happens when we remove large files from s3 (we use it as backup storage for lz4 compressed rbd exports).
Maybe I missed it.

Cheers
 Boris

Am Mi., 23. März 2022 um 13:43 Uhr schrieb Igor Fedotov <igor.fedotov@xxxxxxxx>:

    Hi Boris,

    Curious if you tried to compact RocksdDB for all your OSDs? Sorry
    I this
    has been already discussed, haven't read through all the thread...

     From my experience the symptoms you're facing are pretty common
    for DB
    performance degradation caused by bulk data removal. In that case
    OSDs
    start to flap due to suicide timeout as some regular user ops take
    ages
    to complete.

    The issue has been discussed in this list multiple times.

    Thanks,

    Igor

    On 3/8/2022 12:36 AM, Boris Behrens wrote:
    > Hi,
    >
    > we've had the problem with OSDs marked as offline since we
    updated to
    > octopus and hope the problem would be fixed with the latest
    patch. We have
    > this kind of problem only with octopus and there only with the
    big s3
    > cluster.
    > * Hosts are all Ubuntu 20,04 and we've set the txqueuelen to 10k
    > * Network interfaces are 20gbit (2x10 in a 802.3ad encap3+4 bond)
    > * We only use the frontend network.
    > * All disks are spinning, some have block.db devices.
    > * All disks are bluestore
    > * configs are mostly defaults
    > * we've set the OSDs to restart=always without a limit, because
    we had the
    > problem with unavailable PGs when two OSDs are marked as offline
    and the
    > share PGs.
    >
    > But since we installed the latest patch we are experiencing more
    OSD downs
    > and even crashes.
    > I tried to remove as much duplicated lines as possible.
    >
    > Is the numa error a problem?
    > Why do OSD daemons not respond to hearthbeats? I mean even when
    the disk is
    > totally loaded with IO, the system itself should answer
    heathbeats, or am I
    > missing something?
    >
    > I really hope some of you could send me on the correct way to
    solve this
    > nasty problem.
    >
    > This is how the latest crash looks like
    > Mar 07 17:44:15 s3db18 ceph-osd[4530]: 2022-03-07T17:44:15.099+0000
    > 7f5f05d2a700 -1 osd.161 489755 set_numa_affinity unable to
    identify public
    > interface '' numa node: (2) No such file or directory
    > ...
    > Mar 07 17:49:07 s3db18 ceph-osd[4530]: 2022-03-07T17:49:07.678+0000
    > 7f5f05d2a700 -1 osd.161 489774 set_numa_affinity unable to
    identify public
    > interface '' numa node: (2) No such file or directory
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]: *** Caught signal
    (Aborted) **
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
    > thread_name:tp_osd_tp
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
    > (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0)
    [0x7f5f0d4623c0]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  2: (pthread_kill()+0x38)
    > [0x7f5f0d45ef08]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  3:
    > (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*,
    char const*,
    > unsigned long)+0x471) [0x55a699a01201]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  4:
    > (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*,
    unsigned
    > long, unsigned long)+0x8e) [0x55a699a0199e]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  5:
    > (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0)
    > [0x55a699a224b0]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  6:
    > (ShardedThreadPool::WorkThreadSharded::entry()+0x14)
    [0x55a699a252c4]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  7: (()+0x8609)
    [0x7f5f0d456609]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  8: (clone()+0x43)
    [0x7f5f0cfc0163]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]: 2022-03-07T17:53:07.387+0000
    > 7f5ef1501700 -1 *** Caught signal (Aborted) **
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
    > thread_name:tp_osd_tp
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
    > (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0)
    [0x7f5f0d4623c0]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  2: (pthread_kill()+0x38)
    > [0x7f5f0d45ef08]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  3:
    > (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*,
    char const*,
    > unsigned long)+0x471) [0x55a699a01201]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  4:
    > (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*,
    unsigned
    > long, unsigned long)+0x8e) [0x55a699a0199e]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  5:
    > (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0)
    > [0x55a699a224b0]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  6:
    > (ShardedThreadPool::WorkThreadSharded::entry()+0x14)
    [0x55a699a252c4]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  7: (()+0x8609)
    [0x7f5f0d456609]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  8: (clone()+0x43)
    [0x7f5f0cfc0163]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  NOTE: a copy of the
    executable, or
    > `objdump -rdS <executable>` is needed to interpret this.
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  -5246>
    2022-03-07T17:49:07.678+0000
    > 7f5f05d2a700 -1 osd.161 489774 set_numa_affinity unable to
    identify public
    > interface '' numa node: (2) No such file or directory
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:      0>
    2022-03-07T17:53:07.387+0000
    > 7f5ef1501700 -1 *** Caught signal (Aborted) **
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
    > thread_name:tp_osd_tp
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
    > (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0)
    [0x7f5f0d4623c0]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  2: (pthread_kill()+0x38)
    > [0x7f5f0d45ef08]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  3:
    > (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*,
    char const*,
    > unsigned long)+0x471) [0x55a699a01201]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  4:
    > (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*,
    unsigned
    > long, unsigned long)+0x8e) [0x55a699a0199e]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  5:
    > (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0)
    > [0x55a699a224b0]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  6:
    > (ShardedThreadPool::WorkThreadSharded::entry()+0x14)
    [0x55a699a252c4]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  7: (()+0x8609)
    [0x7f5f0d456609]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  8: (clone()+0x43)
    [0x7f5f0cfc0163]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  NOTE: a copy of the
    executable, or
    > `objdump -rdS <executable>` is needed to interpret this.
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  -5246>
    2022-03-07T17:49:07.678+0000
    > 7f5f05d2a700 -1 osd.161 489774 set_numa_affinity unable to
    identify public
    > interface '' numa node: (2) No such file or directory
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:      0>
    2022-03-07T17:53:07.387+0000
    > 7f5ef1501700 -1 *** Caught signal (Aborted) **
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
    > thread_name:tp_osd_tp
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
    > (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0)
    [0x7f5f0d4623c0]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  2: (pthread_kill()+0x38)
    > [0x7f5f0d45ef08]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  3:
    > (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*,
    char const*,
    > unsigned long)+0x471) [0x55a699a01201]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  4:
    > (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*,
    unsigned
    > long, unsigned long)+0x8e) [0x55a699a0199e]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  5:
    > (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0)
    > [0x55a699a224b0]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  6:
    > (ShardedThreadPool::WorkThreadSharded::entry()+0x14)
    [0x55a699a252c4]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  7: (()+0x8609)
    [0x7f5f0d456609]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  8: (clone()+0x43)
    [0x7f5f0cfc0163]
    > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  NOTE: a copy of the
    executable, or
    > `objdump -rdS <executable>` is needed to interpret this.
    > Mar 07 17:53:09 s3db18 systemd[1]: ceph-osd@161.service: Main
    process
    > exited, code=killed, status=6/ABRT
    > Mar 07 17:53:09 s3db18 systemd[1]: ceph-osd@161.service: Failed
    with result
    > 'signal'.
    > Mar 07 17:53:19 s3db18 systemd[1]: ceph-osd@161.service:
    Scheduled restart
    > job, restart counter is at 1.
    > Mar 07 17:53:19 s3db18 systemd[1]: Stopped Ceph object storage
    daemon
    > osd.161.
    > Mar 07 17:53:19 s3db18 systemd[1]: Starting Ceph object storage
    daemon
    > osd.161...
    > Mar 07 17:53:19 s3db18 systemd[1]: Started Ceph object storage
    daemon
    > osd.161.
    > Mar 07 17:53:20 s3db18 ceph-osd[4009440]:
    2022-03-07T17:53:20.498+0000
    > 7f9617781d80 -1 Falling back to public interface
    > Mar 07 17:53:33 s3db18 ceph-osd[4009440]:
    2022-03-07T17:53:33.906+0000
    > 7f9617781d80 -1 osd.161 489778 log_to_monitors {default=true}
    > Mar 07 17:53:34 s3db18 ceph-osd[4009440]:
    2022-03-07T17:53:34.206+0000
    > 7f96106f2700 -1 osd.161 489778 set_numa_affinity unable to
    identify public
    > interface '' numa node: (2) No such file or directory
    > ...
    > Mar 07 18:58:12 s3db18 ceph-osd[4009440]:
    2022-03-07T18:58:12.717+0000
    > 7f96106f2700 -1 osd.161 489880 set_numa_affinity unable to
    identify public
    > interface '' numa node: (2) No such file or directory
    >
    > And this is how an it looks like when OSDs get marked as out:
    > Mar 03 19:29:04 s3db13 ceph-osd[5792]: 2022-03-03T19:29:04.857+0000
    > 7f16115e0700 -1 osd.97 485814 heartbeat_check: no reply from
    > [XX:22::65]:6886 osd.124 since back
    2022-03-03T19:28:41.250692+0000 front
    > 2022-03-03T19:28:41.250649+0000 (oldest deadline
    > 2022-03-03T19:29:04.150352+0000)
    > ...130 time...
    > Mar 03 21:55:37 s3db13 ceph-osd[5792]: 2022-03-03T21:55:37.844+0000
    > 7f16115e0700 -1 osd.97 486383 heartbeat_check: no reply from
    > [XX:22::65]:6941 osd.124 since back
    2022-03-03T21:55:12.514627+0000 front
    > 2022-03-03T21:55:12.514649+0000 (oldest deadline
    > 2022-03-03T21:55:36.613469+0000)
    > Mar 04 00:00:05 s3db13 ceph-osd[5792]: 2022-03-04T00:00:05.035+0000
    > 7f1613080700 -1 received  signal: Hangup from killall -q -1 ceph-mon
    > ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror (PID:
    1385079)
    > UID: 0
    > Mar 04 00:00:05 s3db13 ceph-osd[5792]: 2022-03-04T00:00:05.047+0000
    > 7f1613080700 -1 received  signal: Hangup from  (PID: 1385080) UID: 0
    > Mar 04 00:06:00 s3db13 sudo[1389262]:     ceph : TTY=unknown ;
    PWD=/ ;
    > USER=root ; COMMAND=/usr/sbin/smartctl -a --json=o /dev/sde
    > Mar 04 00:06:00 s3db13 sudo[1389262]: pam_unix(sudo:session):
    session
    > opened for user root by (uid=0)
    > Mar 04 00:06:00 s3db13 sudo[1389262]: pam_unix(sudo:session):
    session
    > closed for user root
    > Mar 04 00:06:01 s3db13 sudo[1389287]:     ceph : TTY=unknown ;
    PWD=/ ;
    > USER=root ; COMMAND=/usr/sbin/nvme ata smart-log-add --json /dev/sde
    > Mar 04 00:06:01 s3db13 sudo[1389287]: pam_unix(sudo:session):
    session
    > opened for user root by (uid=0)
    > Mar 04 00:06:01 s3db13 sudo[1389287]: pam_unix(sudo:session):
    session
    > closed for user root
    > Mar 05 00:00:10 s3db13 ceph-osd[5792]: 2022-03-05T00:00:10.213+0000
    > 7f1613080700 -1 received  signal: Hangup from killall -q -1 ceph-mon
    > ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror (PID:
    2406262)
    > UID: 0
    > Mar 05 00:00:10 s3db13 ceph-osd[5792]: 2022-03-05T00:00:10.237+0000
    > 7f1613080700 -1 received  signal: Hangup from  (PID: 2406263) UID: 0
    > Mar 05 00:08:03 s3db13 sudo[2411721]:     ceph : TTY=unknown ;
    PWD=/ ;
    > USER=root ; COMMAND=/usr/sbin/smartctl -a --json=o /dev/sde
    > Mar 05 00:08:03 s3db13 sudo[2411721]: pam_unix(sudo:session):
    session
    > opened for user root by (uid=0)
    > Mar 05 00:08:04 s3db13 sudo[2411721]: pam_unix(sudo:session):
    session
    > closed for user root
    > Mar 05 00:08:04 s3db13 sudo[2411725]:     ceph : TTY=unknown ;
    PWD=/ ;
    > USER=root ; COMMAND=/usr/sbin/nvme ata smart-log-add --json /dev/sde
    > Mar 05 00:08:04 s3db13 sudo[2411725]: pam_unix(sudo:session):
    session
    > opened for user root by (uid=0)
    > Mar 05 00:08:04 s3db13 sudo[2411725]: pam_unix(sudo:session):
    session
    > closed for user root
    > Mar 05 19:19:49 s3db13 ceph-osd[5792]: 2022-03-05T19:19:49.189+0000
    > 7f160fddd700 -1 osd.97 486852 set_numa_affinity unable to
    identify public
    > interface '' numa node: (2) No such file or directory
    > Mar 05 19:21:18 s3db13 ceph-osd[5792]: 2022-03-05T19:21:18.377+0000
    > 7f160fddd700 -1 osd.97 486858 set_numa_affinity unable to
    identify public
    > interface '' numa node: (2) No such file or directory
    > Mar 05 19:21:45 s3db13 ceph-osd[5792]: 2022-03-05T19:21:45.304+0000
    > 7f16115e0700 -1 osd.97 486863 heartbeat_check: no reply from
    > [XX:22::60]:6834 osd.171 since back
    2022-03-05T19:21:21.762744+0000 front
    > 2022-03-05T19:21:21.762723+0000 (oldest deadline
    > 2022-03-05T19:21:45.261347+0000)
    > Mar 05 19:21:46 s3db13 ceph-osd[5792]: 2022-03-05T19:21:46.260+0000
    > 7f16115e0700 -1 osd.97 486863 heartbeat_check: no reply from
    > [XX:22::60]:6834 osd.171 since back
    2022-03-05T19:21:21.762744+0000 front
    > 2022-03-05T19:21:21.762723+0000 (oldest deadline
    > 2022-03-05T19:21:45.261347+0000)
    > Mar 05 19:21:47 s3db13 ceph-osd[5792]: 2022-03-05T19:21:47.252+0000
    > 7f16115e0700 -1 osd.97 486863 heartbeat_check: no reply from
    > [XX:22::60]:6834 osd.171 since back
    2022-03-05T19:21:21.762744+0000 front
    > 2022-03-05T19:21:21.762723+0000 (oldest deadline
    > 2022-03-05T19:21:45.261347+0000)
    > Mar 05 19:22:59 s3db13 ceph-osd[5792]: 2022-03-05T19:22:59.636+0000
    > 7f160fddd700 -1 osd.97 486869 set_numa_affinity unable to
    identify public
    > interface '' numa node: (2) No such file or directory
    > Mar 05 19:23:33 s3db13 ceph-osd[5792]: 2022-03-05T19:23:33.439+0000
    > 7f16115e0700 -1 osd.97 486872 get_health_metrics reporting 2
    slow ops,
    > oldest is osd_op(client.2304224848.0:3139913 4.d 4.97748d0d
    (undecoded)
    > ondisk+retry+read+known_if_redirected e486872)
    > Mar 05 19:23:34 s3db13 ceph-osd[5792]: 2022-03-05T19:23:34.458+0000
    > 7f16115e0700 -1 osd.97 486872 get_health_metrics reporting 2
    slow ops,
    > oldest is osd_op(client.2304224848.0:3139913 4.d 4.97748d0d
    (undecoded)
    > ondisk+retry+read+known_if_redirected e486872)
    > Mar 05 19:23:35 s3db13 ceph-osd[5792]: 2022-03-05T19:23:35.434+0000
    > 7f16115e0700 -1 osd.97 486872 heartbeat_check: no reply from
    > [XX:22::60]:6834 osd.171 since back
    2022-03-05T19:23:09.928097+0000 front
    > 2022-03-05T19:23:09.928150+0000 (oldest deadline
    > 2022-03-05T19:23:35.227545+0000)
    > ...
    > Mar 05 19:23:48 s3db13 ceph-osd[5792]: 2022-03-05T19:23:48.386+0000
    > 7f16115e0700 -1 osd.97 486872 get_health_metrics reporting 2
    slow ops,
    > oldest is osd_op(client.2304224848.0:3139913 4.d 4.97748d0d
    (undecoded)
    > ondisk+retry+read+known_if_redirected e486872)
    > Mar 05 19:23:49 s3db13 ceph-osd[5792]: 2022-03-05T19:23:49.362+0000
    > 7f16115e0700 -1 osd.97 486872 heartbeat_check: no reply from
    > [XX:22::60]:6834 osd.171 since back
    2022-03-05T19:23:09.928097+0000 front
    > 2022-03-05T19:23:09.928150+0000 (oldest deadline
    > 2022-03-05T19:23:35.227545+0000)
    > Mar 05 19:23:49 s3db13 ceph-osd[5792]: 2022-03-05T19:23:49.362+0000
    > 7f16115e0700 -1 osd.97 486872 get_health_metrics reporting 2
    slow ops,
    > oldest is osd_op(client.2304224848.0:3139913 4.d 4.97748d0d
    (undecoded)
    > ondisk+retry+read+known_if_redirected e486872)
    > Mar 05 19:23:50 s3db13 ceph-osd[5792]: 2022-03-05T19:23:50.358+0000
    > 7f16115e0700 -1 osd.97 486873 get_health_metrics reporting 2
    slow ops,
    > oldest is osd_op(client.2304224848.0:3139913 4.d 4.97748d0d
    (undecoded)
    > ondisk+retry+read+known_if_redirected e486872)
    > Mar 05 19:23:51 s3db13 ceph-osd[5792]: 2022-03-05T19:23:51.330+0000
    > 7f16115e0700 -1 osd.97 486874 get_health_metrics reporting 2
    slow ops,
    > oldest is osd_op(client.2304224848.0:3139913 4.d
    4:b0b12ee9:::gc.22:head
    > [call rgw_gc.rgw_gc_queue_list_entries in=46b] snapc 0=[] RETRY=9
    > ondisk+retry+read+known_if_redirected e486872)
    > Mar 05 19:23:52 s3db13 ceph-osd[5792]: 2022-03-05T19:23:52.326+0000
    > 7f16115e0700 -1 osd.97 486874 get_health_metrics reporting 2
    slow ops,
    > oldest is osd_op(client.2304224848.0:3139913 4.d
    4:b0b12ee9:::gc.22:head
    > [call rgw_gc.rgw_gc_queue_list_entries in=46b] snapc 0=[] RETRY=9
    > ondisk+retry+read+known_if_redirected e486872)
    > Mar 05 19:23:53 s3db13 ceph-osd[5792]: 2022-03-05T19:23:53.338+0000
    > 7f16115e0700 -1 osd.97 486874 get_health_metrics reporting 2
    slow ops,
    > oldest is osd_op(client.2304224848.0:3139913 4.d
    4:b0b12ee9:::gc.22:head
    > [call rgw_gc.rgw_gc_queue_list_entries in=46b] snapc 0=[] RETRY=9
    > ondisk+retry+read+known_if_redirected e486872)
    > Mar 05 19:25:02 s3db13 ceph-osd[5792]: 2022-03-05T19:25:02.342+0000
    > 7f160fddd700 -1 osd.97 486878 set_numa_affinity unable to
    identify public
    > interface '' numa node: (2) No such file or directory
    > Mar 05 19:25:33 s3db13 ceph-osd[5792]: 2022-03-05T19:25:33.569+0000
    > 7f16115e0700 -1 osd.97 486880 get_health_metrics reporting 2
    slow ops,
    > oldest is osd_op(client.2304224857.0:4271104 4.d 4.97748d0d
    (undecoded)
    > ondisk+retry+write+known_if_redirected e486879)
    > ...
    > Mar 05 19:25:44 s3db13 ceph-osd[5792]: 2022-03-05T19:25:44.476+0000
    > 7f16115e0700 -1 osd.97 486880 get_health_metrics reporting 3
    slow ops,
    > oldest is osd_op(client.2304224857.0:4271104 4.d 4.97748d0d
    (undecoded)
    > ondisk+retry+write+known_if_redirected e486879)
    > Mar 05 19:25:45 s3db13 ceph-osd[5792]: 2022-03-05T19:25:45.456+0000
    > 7f16115e0700 -1 osd.97 486880 heartbeat_check: no reply from
    > [XX:22::60]:6834 osd.171 ever on either front or back, first
    ping sent
    > 2022-03-05T19:25:25.281582+0000 (oldest deadline
    > 2022-03-05T19:25:45.281582+0000)
    > Mar 05 19:25:45 s3db13 ceph-osd[5792]: 2022-03-05T19:25:45.456+0000
    > 7f16115e0700 -1 osd.97 486880 get_health_metrics reporting 3
    slow ops,
    > oldest is osd_op(client.2304224857.0:4271104 4.d 4.97748d0d
    (undecoded)
    > ondisk+retry+write+known_if_redirected e486879)
    > ...
    > Mar 05 19:26:08 s3db13 ceph-osd[5792]: 2022-03-05T19:26:08.363+0000
    > 7f16115e0700 -1 osd.97 486880 get_health_metrics reporting 3
    slow ops,
    > oldest is osd_op(client.2304224857.0:4271104 4.d 4.97748d0d
    (undecoded)
    > ondisk+retry+write+known_if_redirected e486879)
    > Mar 05 19:26:09 s3db13 ceph-osd[5792]: 2022-03-05T19:26:09.371+0000
    > 7f16115e0700 -1 osd.97 486880 heartbeat_check: no reply from
    > [XX:22::60]:6834 osd.171 ever on either front or back, first
    ping sent
    > 2022-03-05T19:25:25.281582+0000 (oldest deadline
    > 2022-03-05T19:25:45.281582+0000)
    > Mar 05 19:26:09 s3db13 ceph-osd[5792]: 2022-03-05T19:26:09.375+0000
    > 7f16115e0700 -1 osd.97 486880 get_health_metrics reporting 3
    slow ops,
    > oldest is osd_op(client.2304224857.0:4271104 4.d 4.97748d0d
    (undecoded)
    > ondisk+retry+write+known_if_redirected e486879)
    > Mar 05 19:26:10 s3db13 ceph-osd[5792]: 2022-03-05T19:26:10.383+0000
    > 7f16115e0700 -1 osd.97 486881 get_health_metrics reporting 3
    slow ops,
    > oldest is osd_op(client.2304224857.0:4271104 4.d 4.97748d0d
    (undecoded)
    > ondisk+retry+write+known_if_redirected e486879)
    > Mar 05 19:26:11 s3db13 ceph-osd[5792]: 2022-03-05T19:26:11.407+0000
    > 7f16115e0700 -1 osd.97 486882 get_health_metrics reporting 1
    slow ops,
    > oldest is osd_op(client.2304224848.0:3139913 4.d
    4:b0b12ee9:::gc.22:head
    > [call rgw_gc.rgw_gc_queue_list_entries in=46b] snapc 0=[] RETRY=11
    > ondisk+retry+read+known_if_redirected e486879)
    > Mar 05 19:26:12 s3db13 ceph-osd[5792]: 2022-03-05T19:26:12.399+0000
    > 7f16115e0700 -1 osd.97 486882 get_health_metrics reporting 1
    slow ops,
    > oldest is osd_op(client.2304224848.0:3139913 4.d
    4:b0b12ee9:::gc.22:head
    > [call rgw_gc.rgw_gc_queue_list_entries in=46b] snapc 0=[] RETRY=11
    > ondisk+retry+read+known_if_redirected e486879)
    > Mar 05 19:27:24 s3db13 ceph-osd[5792]: 2022-03-05T19:27:24.975+0000
    > 7f160fddd700 -1 osd.97 486887 set_numa_affinity unable to
    identify public
    > interface '' numa node: (2) No such file or directory
    > Mar 05 19:27:58 s3db13 ceph-osd[5792]: 2022-03-05T19:27:58.114+0000
    > 7f16115e0700 -1 osd.97 486890 get_health_metrics reporting 4
    slow ops,
    > oldest is osd_op(client.2304235452.0:811825 4.d 4.97748d0d
    (undecoded)
    > ondisk+retry+write+known_if_redirected e486889)
    > ...
    > Mar 05 19:28:08 s3db13 ceph-osd[5792]: 2022-03-05T19:28:08.137+0000
    > 7f16115e0700 -1 osd.97 486890 get_health_metrics reporting 4
    slow ops,
    > oldest is osd_op(client.2304235452.0:811825 4.d 4.97748d0d
    (undecoded)
    > ondisk+retry+write+known_if_redirected e486889)
    > Mar 05 19:28:09 s3db13 ceph-osd[5792]: 2022-03-05T19:28:09.125+0000
    > 7f16115e0700 -1 osd.97 486890 heartbeat_check: no reply from
    > [XX:22::60]:6834 osd.171 ever on either front or back, first
    ping sent
    > 2022-03-05T19:27:48.548094+0000 (oldest deadline
    > 2022-03-05T19:28:08.548094+0000)
    > Mar 05 19:28:09 s3db13 ceph-osd[5792]: 2022-03-05T19:28:09.125+0000
    > 7f16115e0700 -1 osd.97 486890 get_health_metrics reporting 4
    slow ops,
    > oldest is osd_op(client.2304235452.0:811825 4.d 4.97748d0d
    (undecoded)
    > ondisk+retry+write+known_if_redirected e486889)
    > ...
    > Mar 05 19:28:29 s3db13 ceph-osd[5792]: 2022-03-05T19:28:29.060+0000
    > 7f16115e0700 -1 osd.97 486890 get_health_metrics reporting 4
    slow ops,
    > oldest is osd_op(client.2304235452.0:811825 4.d 4.97748d0d
    (undecoded)
    > ondisk+retry+write+known_if_redirected e486889)
    > Mar 05 19:28:30 s3db13 ceph-osd[5792]: 2022-03-05T19:28:30.040+0000
    > 7f16115e0700 -1 osd.97 486890 heartbeat_check: no reply from
    > [XX:22::60]:6834 osd.171 ever on either front or back, first
    ping sent
    > 2022-03-05T19:27:48.548094+0000 (oldest deadline
    > 2022-03-05T19:28:08.548094+0000)
    > Mar 05 19:29:43 s3db13 ceph-osd[5792]: 2022-03-05T19:29:43.696+0000
    > 7f1605dc9700 -1 osd.97 486896 _committed_osd_maps marked down 6 >
    > osd_max_markdown_count 5 in last 600.000000 seconds, shutting down
    > Mar 05 19:29:43 s3db13 ceph-osd[5792]: 2022-03-05T19:29:43.700+0000
    > 7f1613080700 -1 received  signal: Interrupt from Kernel ( Could be
    > generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
    > Mar 05 19:29:43 s3db13 ceph-osd[5792]: 2022-03-05T19:29:43.700+0000
    > 7f1613080700 -1 osd.97 486896 *** Got signal Interrupt ***
    > Mar 05 19:29:43 s3db13 ceph-osd[5792]: 2022-03-05T19:29:43.700+0000
    > 7f1613080700 -1 osd.97 486896 *** Immediate shutdown
    > (osd_fast_shutdown=true) ***
    > Mar 05 19:29:44 s3db13 systemd[1]: ceph-osd@97.service: Succeeded.
    > Mar 05 19:29:54 s3db13 systemd[1]: ceph-osd@97.service:
    Scheduled restart
    > job, restart counter is at 1.
    > Mar 05 19:29:54 s3db13 systemd[1]: Stopped Ceph object storage
    daemon
    > osd.97.
    > Mar 05 19:29:54 s3db13 systemd[1]: Starting Ceph object storage
    daemon
    > osd.97...
    > Mar 05 19:29:54 s3db13 systemd[1]: Started Ceph object storage
    daemon
    > osd.97.
    > Mar 05 19:29:55 s3db13 ceph-osd[3236773]:
    2022-03-05T19:29:55.116+0000
    > 7f5852f74d80 -1 Falling back to public interface
    > Mar 05 19:30:34 s3db13 ceph-osd[3236773]:
    2022-03-05T19:30:34.746+0000
    > 7f5852f74d80 -1 osd.97 486896 log_to_monitors {default=true}

-- Igor Fedotov
    Ceph Lead Developer

    Looking for help with your Ceph cluster? Contact us at
    https://croit.io

    croit GmbH, Freseniusstr. 31h, 81247 Munich
    CEO: Martin Verges - VAT-ID: DE310638492
    Com. register: Amtsgericht Munich HRB 231263
    Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx



--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im groüen Saal.

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us athttps://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web:https://croit.io  | YouTube:https://goo.gl/PGE1Bx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux