Hey Frank, Did you upgrade directly from mimic to octopus? There is a change on the OSDs when upgrading to Octopus to how the OSDs account for OMAP data, we have seen doing upgrades to Octopus especially where there is a large amount of OMAP data stored on the OSDs it can take quite a long time for the OSDs to come back to normal function, it would take about a few hours or so. Typically we would upgrade a node of OSDs at a time, wait for them to finish, and then move onto the next one. You can read more on the Ceph Docs here: https://docs.ceph.com/en/octopus/releases/octopus/#instructions >Note that the first time each OSD starts, it will do a format conversion to improve the accounting for ?omap? data. This may take a few minutes to as much as a few >hours (for an HDD with lots of omap data). If you're to check the systemd status of one of these OSDs, or the logs, you should see lots of mentions about this conversion I believe. Regards, Bailey -----Original Message----- From: Frank Schilder <frans@xxxxxx> Sent: September 5, 2022 1:39 PM To: ceph-users@xxxxxxx Subject: Re: Octopus OSDs extremely slow during upgrade from mimic Another observation. After the cluster health went to only 1PG degraded, we could see that between each of its degraded object's recovery there was a quite long pause. Recovery is finished now, but client IO is still close to 0. After recovery finished I restarted 1 OSD to see if it would improve the situation. It didn't. First of all, startup is unusually slow as well and then it goes into a restart loop (well, marked down-up loop). It gets marked down by the MONs due to is long response time. Peering is extremely slow as well and I had to set nodown to get the OSD to stay in the cluster. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Frank Schilder Sent: 05 September 2022 18:08:05 To: ceph-users@xxxxxxx Subject: Re: Octopus OSDs extremely slow during upgrade from mimic Top shows that the osd_op_tp thread is consuming 100% and the OSD log contains lots of these messages: 2022-09-05T18:06:13.332+0200 7f2778591700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f2759ce4700' had timed out after 15 2022-09-05T18:06:13.332+0200 7f2778591700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f275ace6700' had timed out after 15 2022-09-05T18:06:13.332+0200 7f2778591700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f275bce8700' had timed out after 15 2022-09-05T18:06:13.332+0200 7f2778591700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f275b4e7700' had timed out after 15 2022-09-05T18:06:13.332+0200 7f2777d90700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f2759ce4700' had timed out after 15 2022-09-05T18:06:13.332+0200 7f2777d90700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f275ace6700' had timed out after 15 2022-09-05T18:06:13.332+0200 7f2777d90700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f275bce8700' had timed out after 15 2022-09-05T18:06:13.332+0200 7f2777d90700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f275b4e7700' had timed out after 15 2022-09-05T18:06:13.368+0200 7f2778591700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f2759ce4700' had timed out after 15 2022-09-05T18:06:13.368+0200 7f2778591700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f275ace6700' had timed out after 15 Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Frank Schilder <frans@xxxxxx> Sent: 05 September 2022 17:53:57 To: ceph-users@xxxxxxx Subject: Octopus OSDs extremely slow during upgrade from mimic Hi all, we are performing an upgrade from mimic to octopus on a test cluster and observe that octopus OSDs are slow to the point that IO is close to impossible. The situation: - We are running a test workload to simulate a realistic situation. - We have tested the workload with both, octopus and mimic also under degraded conditions and everything worked well. - Now we are in the middle of the upgrade and the cluster has to repair missed writes from the time the OSDs of a host were upgraded to octopus. - Since this upgrade, the performance of the octopus OSDs is extremely poor. We had ca. 5000/46817475 degraded objects. This is a number that would be repaired within a few seconds or minutes at most under normal conditions. Right now we observe negligible recovery speed. What I see on the hosts is that the mimic OSDs are mostly idle and the octopus OSDs are at 100% CPU. It seems to point to the octopus OSDs being the bottleneck. Network traffic and everything else basically collapsed to 0 after upgrading the first 3 OSDs. Does anyone have an idea what the bottleneck is and how it can be overcome? Some diagnostic info: # ceph status cluster: id: bf1f51f5-b381-4cf7-b3db-88d044c1960c health: HEALTH_WARN clients are using insecure global_id reclaim mons are allowing insecure global_id reclaim 3 OSD(s) reporting legacy (not per-pool) BlueStore stats 3 OSD(s) reporting legacy (not per-pool) BlueStore omap usage stats 1 MDSs report slow requests 3 monitors have not enabled msgr2 noout flag(s) set Degraded data redundancy: 2616/46818177 objects degraded (0.006%), 158 pgs degraded, 42 pgs undersized 5 slow ops, oldest one blocked for 119 sec, daemons [osd.0,osd.2,osd.3,osd.4,osd.6] have slow ops. services: mon: 3 daemons, quorum tceph-01,tceph-02,tceph-03 (age 49m) mgr: tceph-01(active, since 44m), standbys: tceph-03, tceph-02 mds: fs:1 {0=tceph-03=up:active} 2 up:standby osd: 9 osds: 9 up, 9 in; 42 remapped pgs flags noout data: pools: 4 pools, 321 pgs objects: 10.42M objects, 352 GiB usage: 1.7 TiB used, 769 GiB / 2.4 TiB avail pgs: 2616/46818177 objects degraded (0.006%) 116 active+clean+snaptrim_wait 90 active+recovery_wait+degraded 41 active+recovery_wait+undersized+degraded+remapped 26 active+clean 26 active+recovering+degraded 18 active+clean+snaptrim 2 active+recovery_wait 1 active+recovering 1 active+recovering+undersized+degraded+remapped io: client: 18 KiB/s wr, 0 op/s rd, 1 op/s wr recovery: 0 B/s, 0 objects/s # ceph versions { "mon": { "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 3 }, "mgr": { "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 3 }, "osd": { "ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable)": 6, "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 3 }, "mds": { "ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable)": 3 }, "overall": { "ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable)": 9, "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 9 } } # ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 2.44798 root default -3 0.81599 host tceph-01 0 hdd 0.27199 osd.0 up 0.84999 1.00000 <- octopus 3 hdd 0.27199 osd.3 up 0.89999 1.00000 <- octopus 6 hdd 0.27199 osd.6 up 0.95000 1.00000 <- octopus -5 0.81599 host tceph-02 2 hdd 0.27199 osd.2 up 1.00000 1.00000 <- mimic 5 hdd 0.27199 osd.5 up 0.84999 1.00000 <- mimic 7 hdd 0.27199 osd.7 up 0.95000 1.00000 <- mimic -7 0.81599 host tceph-03 1 hdd 0.27199 osd.1 up 0.95000 1.00000 <- mimic 4 hdd 0.27199 osd.4 up 0.89999 1.00000 <- mimic 8 hdd 0.27199 osd.8 up 1.00000 1.00000 <- mimic # ceph config dump WHO MASK LEVEL OPTION VALUE RO global unknown bluefs_preextend_wal_files true * global advanced osd_map_message_max_bytes 16384 global advanced osd_op_queue wpq * global advanced osd_op_queue_cut_off high * mon advanced mon_sync_max_payload_size 4096 mgr unknown mgr/dashboard/password $2b$12$DYJkkmdzaVtFR.GWYhTT.ezwGgNLi1BL7meoY.z8ya4PP9MfZIPqu * mgr unknown mgr/dashboard/username rit * osd dev bluestore_fsck_quick_fix_on_mount false osd class:hdd advanced osd_max_backfills 18 osd class:hdd dev osd_memory_cache_min 805306368 osd class:hdd basic osd_memory_target 1611661312 osd class:hdd advanced osd_recovery_max_active 8 osd class:hdd advanced osd_recovery_sleep 0.050000 osd class:hdd advanced osd_snap_trim_sleep 0.100000 mds basic client_cache_size 8192 mds advanced mds_bal_fragment_size_max 500000 mds basic mds_cache_memory_limit 17179869184 mds advanced mds_cache_reservation 0.500000 mds advanced mds_max_caps_per_client 65536 mds advanced mds_min_caps_per_client 4096 mds advanced mds_recall_max_caps 16384 mds advanced mds_session_blacklist_on_timeout false # ceph config get osd.0 bluefs_buffered_io true Thanks for any pointers, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx