Octopus OSDs extremely slow during upgrade from mimic

Frank Schilder <frans@xxxxxx> · Mon, 5 Sep 2022 15:53:57 +0000

Hi all,

we are performing an upgrade from mimic to octopus on a test cluster and observe that octopus OSDs are slow to the point that IO is close to impossible. The situation:

- We are running a test workload to simulate a realistic situation.
- We have tested the workload with both, octopus and mimic also under degraded conditions and everything worked well.
- Now we are in the middle of the upgrade and the cluster has to repair missed writes from the time the OSDs of a host were upgraded to octopus.
- Since this upgrade, the performance of the octopus OSDs is extremely poor.

We had ca. 5000/46817475 degraded objects. This is a number that would be repaired within a few seconds or minutes at most under normal conditions. Right now we observe negligible recovery speed. What I see on the hosts is that the mimic OSDs are mostly idle and the octopus OSDs are at 100% CPU. It seems to point to the octopus OSDs being the bottleneck. Network traffic and everything else basically collapsed to 0 after upgrading the first 3 OSDs.

Does anyone have an idea what the bottleneck is and how it can be overcome?

Some diagnostic info:

# ceph status
  cluster:
    id:     bf1f51f5-b381-4cf7-b3db-88d044c1960c
    health: HEALTH_WARN
            clients are using insecure global_id reclaim
            mons are allowing insecure global_id reclaim
            3 OSD(s) reporting legacy (not per-pool) BlueStore stats
            3 OSD(s) reporting legacy (not per-pool) BlueStore omap usage stats
            1 MDSs report slow requests
            3 monitors have not enabled msgr2
            noout flag(s) set
            Degraded data redundancy: 2616/46818177 objects degraded (0.006%), 158 pgs degraded, 42 pgs undersized
            5 slow ops, oldest one blocked for 119 sec, daemons [osd.0,osd.2,osd.3,osd.4,osd.6] have slow ops.

  services:
    mon: 3 daemons, quorum tceph-01,tceph-02,tceph-03 (age 49m)
    mgr: tceph-01(active, since 44m), standbys: tceph-03, tceph-02
    mds: fs:1 {0=tceph-03=up:active} 2 up:standby
    osd: 9 osds: 9 up, 9 in; 42 remapped pgs
         flags noout

  data:
    pools:   4 pools, 321 pgs
    objects: 10.42M objects, 352 GiB
    usage:   1.7 TiB used, 769 GiB / 2.4 TiB avail
    pgs:     2616/46818177 objects degraded (0.006%)
             116 active+clean+snaptrim_wait
             90  active+recovery_wait+degraded
             41  active+recovery_wait+undersized+degraded+remapped
             26  active+clean
             26  active+recovering+degraded
             18  active+clean+snaptrim
             2   active+recovery_wait
             1   active+recovering
             1   active+recovering+undersized+degraded+remapped

  io:
    client:   18 KiB/s wr, 0 op/s rd, 1 op/s wr
    recovery: 0 B/s, 0 objects/s

# ceph versions
{
    "mon": {
        "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 3
    },
    "mgr": {
        "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 3
    },
    "osd": {
        "ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable)": 6,
        "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 3
    },
    "mds": {
        "ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable)": 3
    },
    "overall": {
        "ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable)": 9,
        "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 9
    }
}

# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME          STATUS  REWEIGHT  PRI-AFF
-1         2.44798  root default                                
-3         0.81599      host tceph-01                           
 0    hdd  0.27199          osd.0          up   0.84999  1.00000 <- octopus
 3    hdd  0.27199          osd.3          up   0.89999  1.00000 <- octopus
 6    hdd  0.27199          osd.6          up   0.95000  1.00000 <- octopus
-5         0.81599      host tceph-02                           
 2    hdd  0.27199          osd.2          up   1.00000  1.00000 <- mimic
 5    hdd  0.27199          osd.5          up   0.84999  1.00000 <- mimic
 7    hdd  0.27199          osd.7          up   0.95000  1.00000 <- mimic
-7         0.81599      host tceph-03                           
 1    hdd  0.27199          osd.1          up   0.95000  1.00000 <- mimic
 4    hdd  0.27199          osd.4          up   0.89999  1.00000 <- mimic
 8    hdd  0.27199          osd.8          up   1.00000  1.00000 <- mimic

# ceph config dump
WHO     MASK       LEVEL     OPTION                             VALUE                                                         RO
global             unknown   bluefs_preextend_wal_files         true                                                          * 
global             advanced  osd_map_message_max_bytes          16384                                                           
global             advanced  osd_op_queue                       wpq                                                           * 
global             advanced  osd_op_queue_cut_off               high                                                          * 
  mon              advanced  mon_sync_max_payload_size          4096                                                            
  mgr              unknown   mgr/dashboard/password             $2b$12$DYJkkmdzaVtFR.GWYhTT.ezwGgNLi1BL7meoY.z8ya4PP9MfZIPqu  * 
  mgr              unknown   mgr/dashboard/username             rit                                                           * 
  osd              dev       bluestore_fsck_quick_fix_on_mount  false                                                           
  osd   class:hdd  advanced  osd_max_backfills                  18                                                              
  osd   class:hdd  dev       osd_memory_cache_min               805306368                                                       
  osd   class:hdd  basic     osd_memory_target                  1611661312                                                      
  osd   class:hdd  advanced  osd_recovery_max_active            8                                                               
  osd   class:hdd  advanced  osd_recovery_sleep                 0.050000                                                        
  osd   class:hdd  advanced  osd_snap_trim_sleep                0.100000                                                        
  mds              basic     client_cache_size                  8192                                                            
  mds              advanced  mds_bal_fragment_size_max          500000                                                          
  mds              basic     mds_cache_memory_limit             17179869184                                                     
  mds              advanced  mds_cache_reservation              0.500000                                                        
  mds              advanced  mds_max_caps_per_client            65536                                                           
  mds              advanced  mds_min_caps_per_client            4096                                                            
  mds              advanced  mds_recall_max_caps                16384                                                           
  mds              advanced  mds_session_blacklist_on_timeout   false                                                           

# ceph config get osd.0 bluefs_buffered_io
true

Thanks for any pointers,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx