Re: Octopus OSDs extremely slow during upgrade from mimic

Frank Schilder <frans@xxxxxx> · Mon, 5 Sep 2022 16:39:11 +0000

Another observation. After the cluster health went to only 1PG degraded, we could see that between each of its degraded object's recovery there was a quite long pause. Recovery is finished now, but client IO is still close to 0.

After recovery finished I restarted 1 OSD to see if it would improve the situation. It didn't. First of all, startup is unusually slow as well and then it goes into a restart loop (well, marked down-up loop). It gets marked down by the MONs due to is long response time. Peering is extremely slow as well and I had to set nodown to get the OSD to stay in the cluster.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder
Sent: 05 September 2022 18:08:05
To: ceph-users@xxxxxxx
Subject: Re: Octopus OSDs extremely slow during upgrade from mimic

Top shows that the osd_op_tp thread is consuming 100% and the OSD log contains lots of these messages:

2022-09-05T18:06:13.332+0200 7f2778591700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f2759ce4700' had timed out after 15
2022-09-05T18:06:13.332+0200 7f2778591700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f275ace6700' had timed out after 15
2022-09-05T18:06:13.332+0200 7f2778591700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f275bce8700' had timed out after 15
2022-09-05T18:06:13.332+0200 7f2778591700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f275b4e7700' had timed out after 15
2022-09-05T18:06:13.332+0200 7f2777d90700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f2759ce4700' had timed out after 15
2022-09-05T18:06:13.332+0200 7f2777d90700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f275ace6700' had timed out after 15
2022-09-05T18:06:13.332+0200 7f2777d90700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f275bce8700' had timed out after 15
2022-09-05T18:06:13.332+0200 7f2777d90700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f275b4e7700' had timed out after 15
2022-09-05T18:06:13.368+0200 7f2778591700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f2759ce4700' had timed out after 15
2022-09-05T18:06:13.368+0200 7f2778591700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f275ace6700' had timed out after 15

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: 05 September 2022 17:53:57
To: ceph-users@xxxxxxx
Subject:  Octopus OSDs extremely slow during upgrade from mimic

Hi all,

we are performing an upgrade from mimic to octopus on a test cluster and observe that octopus OSDs are slow to the point that IO is close to impossible. The situation:

- We are running a test workload to simulate a realistic situation.
- We have tested the workload with both, octopus and mimic also under degraded conditions and everything worked well.
- Now we are in the middle of the upgrade and the cluster has to repair missed writes from the time the OSDs of a host were upgraded to octopus.
- Since this upgrade, the performance of the octopus OSDs is extremely poor.

We had ca. 5000/46817475 degraded objects. This is a number that would be repaired within a few seconds or minutes at most under normal conditions. Right now we observe negligible recovery speed. What I see on the hosts is that the mimic OSDs are mostly idle and the octopus OSDs are at 100% CPU. It seems to point to the octopus OSDs being the bottleneck. Network traffic and everything else basically collapsed to 0 after upgrading the first 3 OSDs.

Does anyone have an idea what the bottleneck is and how it can be overcome?

Some diagnostic info:

# ceph status
  cluster:
    id:     bf1f51f5-b381-4cf7-b3db-88d044c1960c
    health: HEALTH_WARN
            clients are using insecure global_id reclaim
            mons are allowing insecure global_id reclaim
            3 OSD(s) reporting legacy (not per-pool) BlueStore stats
            3 OSD(s) reporting legacy (not per-pool) BlueStore omap usage stats
            1 MDSs report slow requests
            3 monitors have not enabled msgr2
            noout flag(s) set
            Degraded data redundancy: 2616/46818177 objects degraded (0.006%), 158 pgs degraded, 42 pgs undersized
            5 slow ops, oldest one blocked for 119 sec, daemons [osd.0,osd.2,osd.3,osd.4,osd.6] have slow ops.

  services:
    mon: 3 daemons, quorum tceph-01,tceph-02,tceph-03 (age 49m)
    mgr: tceph-01(active, since 44m), standbys: tceph-03, tceph-02
    mds: fs:1 {0=tceph-03=up:active} 2 up:standby
    osd: 9 osds: 9 up, 9 in; 42 remapped pgs
         flags noout

  data:
    pools:   4 pools, 321 pgs
    objects: 10.42M objects, 352 GiB
    usage:   1.7 TiB used, 769 GiB / 2.4 TiB avail
    pgs:     2616/46818177 objects degraded (0.006%)
             116 active+clean+snaptrim_wait
             90  active+recovery_wait+degraded
             41  active+recovery_wait+undersized+degraded+remapped
             26  active+clean
             26  active+recovering+degraded
             18  active+clean+snaptrim
             2   active+recovery_wait
             1   active+recovering
             1   active+recovering+undersized+degraded+remapped

  io:
    client:   18 KiB/s wr, 0 op/s rd, 1 op/s wr
    recovery: 0 B/s, 0 objects/s

# ceph versions
{
    "mon": {
        "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 3
    },
    "mgr": {
        "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 3
    },
    "osd": {
        "ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable)": 6,
        "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 3
    },
    "mds": {
        "ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable)": 3
    },
    "overall": {
        "ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable)": 9,
        "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 9
    }
}

# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME          STATUS  REWEIGHT  PRI-AFF
-1         2.44798  root default
-3         0.81599      host tceph-01
 0    hdd  0.27199          osd.0          up   0.84999  1.00000 <- octopus
 3    hdd  0.27199          osd.3          up   0.89999  1.00000 <- octopus
 6    hdd  0.27199          osd.6          up   0.95000  1.00000 <- octopus
-5         0.81599      host tceph-02
 2    hdd  0.27199          osd.2          up   1.00000  1.00000 <- mimic
 5    hdd  0.27199          osd.5          up   0.84999  1.00000 <- mimic
 7    hdd  0.27199          osd.7          up   0.95000  1.00000 <- mimic
-7         0.81599      host tceph-03
 1    hdd  0.27199          osd.1          up   0.95000  1.00000 <- mimic
 4    hdd  0.27199          osd.4          up   0.89999  1.00000 <- mimic
 8    hdd  0.27199          osd.8          up   1.00000  1.00000 <- mimic

# ceph config dump
WHO     MASK       LEVEL     OPTION                             VALUE                                                         RO
global             unknown   bluefs_preextend_wal_files         true                                                          *
global             advanced  osd_map_message_max_bytes          16384
global             advanced  osd_op_queue                       wpq                                                           *
global             advanced  osd_op_queue_cut_off               high                                                          *
  mon              advanced  mon_sync_max_payload_size          4096
  mgr              unknown   mgr/dashboard/password             $2b$12$DYJkkmdzaVtFR.GWYhTT.ezwGgNLi1BL7meoY.z8ya4PP9MfZIPqu  *
  mgr              unknown   mgr/dashboard/username             rit                                                           *
  osd              dev       bluestore_fsck_quick_fix_on_mount  false
  osd   class:hdd  advanced  osd_max_backfills                  18
  osd   class:hdd  dev       osd_memory_cache_min               805306368
  osd   class:hdd  basic     osd_memory_target                  1611661312
  osd   class:hdd  advanced  osd_recovery_max_active            8
  osd   class:hdd  advanced  osd_recovery_sleep                 0.050000
  osd   class:hdd  advanced  osd_snap_trim_sleep                0.100000
  mds              basic     client_cache_size                  8192
  mds              advanced  mds_bal_fragment_size_max          500000
  mds              basic     mds_cache_memory_limit             17179869184
  mds              advanced  mds_cache_reservation              0.500000
  mds              advanced  mds_max_caps_per_client            65536
  mds              advanced  mds_min_caps_per_client            4096
  mds              advanced  mds_recall_max_caps                16384
  mds              advanced  mds_session_blacklist_on_timeout   false

# ceph config get osd.0 bluefs_buffered_io
true

Thanks for any pointers,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx