Re: Octopus OSDs extremely slow during upgrade from mimic

"Bailey Allison" <ballison@xxxxxxxxxxxx> · Mon, 5 Sep 2022 16:41:22 -0300

Hey Frank,

Did you upgrade directly from mimic to octopus?

There is a change on the OSDs when upgrading to Octopus to how the OSDs
account for OMAP data, we have seen doing upgrades to Octopus especially
where there is a large amount of OMAP data stored on the OSDs it can take
quite a long time for the OSDs to come back to normal function, it would
take about a few hours or so. Typically we would upgrade a node of OSDs at a
time, wait for them to finish, and then move onto the next one.

You can read more on the Ceph Docs here: 

https://docs.ceph.com/en/octopus/releases/octopus/#instructions

>Note that the first time each OSD starts, it will do a format conversion to
improve the accounting for ?omap? data. This may take a few minutes to as
much as a few >hours (for an HDD with lots of omap data).

If you're to check the systemd status of one of these OSDs, or the logs, you
should see lots of mentions about this conversion I believe.

Regards,

Bailey

-----Original Message-----
From: Frank Schilder <frans@xxxxxx> 
Sent: September 5, 2022 1:39 PM
To: ceph-users@xxxxxxx
Subject:  Re: Octopus OSDs extremely slow during upgrade from
mimic

Another observation. After the cluster health went to only 1PG degraded, we
could see that between each of its degraded object's recovery there was a
quite long pause. Recovery is finished now, but client IO is still close to
0.

After recovery finished I restarted 1 OSD to see if it would improve the
situation. It didn't. First of all, startup is unusually slow as well and
then it goes into a restart loop (well, marked down-up loop). It gets marked
down by the MONs due to is long response time. Peering is extremely slow as
well and I had to set nodown to get the OSD to stay in the cluster.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder
Sent: 05 September 2022 18:08:05
To: ceph-users@xxxxxxx
Subject: Re: Octopus OSDs extremely slow during upgrade from mimic

Top shows that the osd_op_tp thread is consuming 100% and the OSD log
contains lots of these messages:

2022-09-05T18:06:13.332+0200 7f2778591700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f2759ce4700' had timed out after 15
2022-09-05T18:06:13.332+0200 7f2778591700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f275ace6700' had timed out after 15
2022-09-05T18:06:13.332+0200 7f2778591700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f275bce8700' had timed out after 15
2022-09-05T18:06:13.332+0200 7f2778591700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f275b4e7700' had timed out after 15
2022-09-05T18:06:13.332+0200 7f2777d90700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f2759ce4700' had timed out after 15
2022-09-05T18:06:13.332+0200 7f2777d90700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f275ace6700' had timed out after 15
2022-09-05T18:06:13.332+0200 7f2777d90700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f275bce8700' had timed out after 15
2022-09-05T18:06:13.332+0200 7f2777d90700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f275b4e7700' had timed out after 15
2022-09-05T18:06:13.368+0200 7f2778591700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f2759ce4700' had timed out after 15
2022-09-05T18:06:13.368+0200 7f2778591700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f275ace6700' had timed out after 15

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: 05 September 2022 17:53:57
To: ceph-users@xxxxxxx
Subject:  Octopus OSDs extremely slow during upgrade from mimic

Hi all,

we are performing an upgrade from mimic to octopus on a test cluster and
observe that octopus OSDs are slow to the point that IO is close to
impossible. The situation:

- We are running a test workload to simulate a realistic situation.
- We have tested the workload with both, octopus and mimic also under
degraded conditions and everything worked well.
- Now we are in the middle of the upgrade and the cluster has to repair
missed writes from the time the OSDs of a host were upgraded to octopus.
- Since this upgrade, the performance of the octopus OSDs is extremely poor.

We had ca. 5000/46817475 degraded objects. This is a number that would be
repaired within a few seconds or minutes at most under normal conditions.
Right now we observe negligible recovery speed. What I see on the hosts is
that the mimic OSDs are mostly idle and the octopus OSDs are at 100% CPU. It
seems to point to the octopus OSDs being the bottleneck. Network traffic and
everything else basically collapsed to 0 after upgrading the first 3 OSDs.

Does anyone have an idea what the bottleneck is and how it can be overcome?

Some diagnostic info:

# ceph status
  cluster:
    id:     bf1f51f5-b381-4cf7-b3db-88d044c1960c
    health: HEALTH_WARN
            clients are using insecure global_id reclaim
            mons are allowing insecure global_id reclaim
            3 OSD(s) reporting legacy (not per-pool) BlueStore stats
            3 OSD(s) reporting legacy (not per-pool) BlueStore omap usage
stats
            1 MDSs report slow requests
            3 monitors have not enabled msgr2
            noout flag(s) set
            Degraded data redundancy: 2616/46818177 objects degraded
(0.006%), 158 pgs degraded, 42 pgs undersized
            5 slow ops, oldest one blocked for 119 sec, daemons
[osd.0,osd.2,osd.3,osd.4,osd.6] have slow ops.

  services:
    mon: 3 daemons, quorum tceph-01,tceph-02,tceph-03 (age 49m)
    mgr: tceph-01(active, since 44m), standbys: tceph-03, tceph-02
    mds: fs:1 {0=tceph-03=up:active} 2 up:standby
    osd: 9 osds: 9 up, 9 in; 42 remapped pgs
         flags noout

  data:
    pools:   4 pools, 321 pgs
    objects: 10.42M objects, 352 GiB
    usage:   1.7 TiB used, 769 GiB / 2.4 TiB avail
    pgs:     2616/46818177 objects degraded (0.006%)
             116 active+clean+snaptrim_wait
             90  active+recovery_wait+degraded
             41  active+recovery_wait+undersized+degraded+remapped
             26  active+clean
             26  active+recovering+degraded
             18  active+clean+snaptrim
             2   active+recovery_wait
             1   active+recovering
             1   active+recovering+undersized+degraded+remapped

  io:
    client:   18 KiB/s wr, 0 op/s rd, 1 op/s wr
    recovery: 0 B/s, 0 objects/s

# ceph versions
{
    "mon": {
        "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4)
octopus (stable)": 3
    },
    "mgr": {
        "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4)
octopus (stable)": 3
    },
    "osd": {
        "ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641)
mimic (stable)": 6,
        "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4)
octopus (stable)": 3
    },
    "mds": {
        "ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641)
mimic (stable)": 3
    },
    "overall": {
        "ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641)
mimic (stable)": 9,
        "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4)
octopus (stable)": 9
    }
}

# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME          STATUS  REWEIGHT  PRI-AFF
-1         2.44798  root default
-3         0.81599      host tceph-01
 0    hdd  0.27199          osd.0          up   0.84999  1.00000 <- octopus
 3    hdd  0.27199          osd.3          up   0.89999  1.00000 <- octopus
 6    hdd  0.27199          osd.6          up   0.95000  1.00000 <- octopus
-5         0.81599      host tceph-02
 2    hdd  0.27199          osd.2          up   1.00000  1.00000 <- mimic
 5    hdd  0.27199          osd.5          up   0.84999  1.00000 <- mimic
 7    hdd  0.27199          osd.7          up   0.95000  1.00000 <- mimic
-7         0.81599      host tceph-03
 1    hdd  0.27199          osd.1          up   0.95000  1.00000 <- mimic
 4    hdd  0.27199          osd.4          up   0.89999  1.00000 <- mimic
 8    hdd  0.27199          osd.8          up   1.00000  1.00000 <- mimic

# ceph config dump
WHO     MASK       LEVEL     OPTION                             VALUE
RO
global             unknown   bluefs_preextend_wal_files         true
*
global             advanced  osd_map_message_max_bytes          16384
global             advanced  osd_op_queue                       wpq
*
global             advanced  osd_op_queue_cut_off               high
*
  mon              advanced  mon_sync_max_payload_size          4096
  mgr              unknown   mgr/dashboard/password
$2b$12$DYJkkmdzaVtFR.GWYhTT.ezwGgNLi1BL7meoY.z8ya4PP9MfZIPqu  *
  mgr              unknown   mgr/dashboard/username             rit
*
  osd              dev       bluestore_fsck_quick_fix_on_mount  false
  osd   class:hdd  advanced  osd_max_backfills                  18
  osd   class:hdd  dev       osd_memory_cache_min               805306368
  osd   class:hdd  basic     osd_memory_target                  1611661312
  osd   class:hdd  advanced  osd_recovery_max_active            8
  osd   class:hdd  advanced  osd_recovery_sleep                 0.050000
  osd   class:hdd  advanced  osd_snap_trim_sleep                0.100000
  mds              basic     client_cache_size                  8192
  mds              advanced  mds_bal_fragment_size_max          500000
  mds              basic     mds_cache_memory_limit             17179869184
  mds              advanced  mds_cache_reservation              0.500000
  mds              advanced  mds_max_caps_per_client            65536
  mds              advanced  mds_min_caps_per_client            4096
  mds              advanced  mds_recall_max_caps                16384
  mds              advanced  mds_session_blacklist_on_timeout   false

# ceph config get osd.0 bluefs_buffered_io true

Thanks for any pointers,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email
to ceph-users-leave@xxxxxxx _______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email
to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx