Another observation. After the cluster health went to only 1PG degraded, we could see that between each of its degraded object's recovery there was a quite long pause. Recovery is finished now, but client IO is still close to 0. After recovery finished I restarted 1 OSD to see if it would improve the situation. It didn't. First of all, startup is unusually slow as well and then it goes into a restart loop (well, marked down-up loop). It gets marked down by the MONs due to is long response time. Peering is extremely slow as well and I had to set nodown to get the OSD to stay in the cluster. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Frank Schilder Sent: 05 September 2022 18:08:05 To: ceph-users@xxxxxxx Subject: Re: Octopus OSDs extremely slow during upgrade from mimic Top shows that the osd_op_tp thread is consuming 100% and the OSD log contains lots of these messages: 2022-09-05T18:06:13.332+0200 7f2778591700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f2759ce4700' had timed out after 15 2022-09-05T18:06:13.332+0200 7f2778591700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f275ace6700' had timed out after 15 2022-09-05T18:06:13.332+0200 7f2778591700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f275bce8700' had timed out after 15 2022-09-05T18:06:13.332+0200 7f2778591700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f275b4e7700' had timed out after 15 2022-09-05T18:06:13.332+0200 7f2777d90700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f2759ce4700' had timed out after 15 2022-09-05T18:06:13.332+0200 7f2777d90700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f275ace6700' had timed out after 15 2022-09-05T18:06:13.332+0200 7f2777d90700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f275bce8700' had timed out after 15 2022-09-05T18:06:13.332+0200 7f2777d90700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f275b4e7700' had timed out after 15 2022-09-05T18:06:13.368+0200 7f2778591700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f2759ce4700' had timed out after 15 2022-09-05T18:06:13.368+0200 7f2778591700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f275ace6700' had timed out after 15 Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Frank Schilder <frans@xxxxxx> Sent: 05 September 2022 17:53:57 To: ceph-users@xxxxxxx Subject: Octopus OSDs extremely slow during upgrade from mimic Hi all, we are performing an upgrade from mimic to octopus on a test cluster and observe that octopus OSDs are slow to the point that IO is close to impossible. The situation: - We are running a test workload to simulate a realistic situation. - We have tested the workload with both, octopus and mimic also under degraded conditions and everything worked well. - Now we are in the middle of the upgrade and the cluster has to repair missed writes from the time the OSDs of a host were upgraded to octopus. - Since this upgrade, the performance of the octopus OSDs is extremely poor. We had ca. 5000/46817475 degraded objects. This is a number that would be repaired within a few seconds or minutes at most under normal conditions. Right now we observe negligible recovery speed. What I see on the hosts is that the mimic OSDs are mostly idle and the octopus OSDs are at 100% CPU. It seems to point to the octopus OSDs being the bottleneck. Network traffic and everything else basically collapsed to 0 after upgrading the first 3 OSDs. Does anyone have an idea what the bottleneck is and how it can be overcome? Some diagnostic info: # ceph status cluster: id: bf1f51f5-b381-4cf7-b3db-88d044c1960c health: HEALTH_WARN clients are using insecure global_id reclaim mons are allowing insecure global_id reclaim 3 OSD(s) reporting legacy (not per-pool) BlueStore stats 3 OSD(s) reporting legacy (not per-pool) BlueStore omap usage stats 1 MDSs report slow requests 3 monitors have not enabled msgr2 noout flag(s) set Degraded data redundancy: 2616/46818177 objects degraded (0.006%), 158 pgs degraded, 42 pgs undersized 5 slow ops, oldest one blocked for 119 sec, daemons [osd.0,osd.2,osd.3,osd.4,osd.6] have slow ops. services: mon: 3 daemons, quorum tceph-01,tceph-02,tceph-03 (age 49m) mgr: tceph-01(active, since 44m), standbys: tceph-03, tceph-02 mds: fs:1 {0=tceph-03=up:active} 2 up:standby osd: 9 osds: 9 up, 9 in; 42 remapped pgs flags noout data: pools: 4 pools, 321 pgs objects: 10.42M objects, 352 GiB usage: 1.7 TiB used, 769 GiB / 2.4 TiB avail pgs: 2616/46818177 objects degraded (0.006%) 116 active+clean+snaptrim_wait 90 active+recovery_wait+degraded 41 active+recovery_wait+undersized+degraded+remapped 26 active+clean 26 active+recovering+degraded 18 active+clean+snaptrim 2 active+recovery_wait 1 active+recovering 1 active+recovering+undersized+degraded+remapped io: client: 18 KiB/s wr, 0 op/s rd, 1 op/s wr recovery: 0 B/s, 0 objects/s # ceph versions { "mon": { "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 3 }, "mgr": { "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 3 }, "osd": { "ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable)": 6, "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 3 }, "mds": { "ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable)": 3 }, "overall": { "ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable)": 9, "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 9 } } # ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 2.44798 root default -3 0.81599 host tceph-01 0 hdd 0.27199 osd.0 up 0.84999 1.00000 <- octopus 3 hdd 0.27199 osd.3 up 0.89999 1.00000 <- octopus 6 hdd 0.27199 osd.6 up 0.95000 1.00000 <- octopus -5 0.81599 host tceph-02 2 hdd 0.27199 osd.2 up 1.00000 1.00000 <- mimic 5 hdd 0.27199 osd.5 up 0.84999 1.00000 <- mimic 7 hdd 0.27199 osd.7 up 0.95000 1.00000 <- mimic -7 0.81599 host tceph-03 1 hdd 0.27199 osd.1 up 0.95000 1.00000 <- mimic 4 hdd 0.27199 osd.4 up 0.89999 1.00000 <- mimic 8 hdd 0.27199 osd.8 up 1.00000 1.00000 <- mimic # ceph config dump WHO MASK LEVEL OPTION VALUE RO global unknown bluefs_preextend_wal_files true * global advanced osd_map_message_max_bytes 16384 global advanced osd_op_queue wpq * global advanced osd_op_queue_cut_off high * mon advanced mon_sync_max_payload_size 4096 mgr unknown mgr/dashboard/password $2b$12$DYJkkmdzaVtFR.GWYhTT.ezwGgNLi1BL7meoY.z8ya4PP9MfZIPqu * mgr unknown mgr/dashboard/username rit * osd dev bluestore_fsck_quick_fix_on_mount false osd class:hdd advanced osd_max_backfills 18 osd class:hdd dev osd_memory_cache_min 805306368 osd class:hdd basic osd_memory_target 1611661312 osd class:hdd advanced osd_recovery_max_active 8 osd class:hdd advanced osd_recovery_sleep 0.050000 osd class:hdd advanced osd_snap_trim_sleep 0.100000 mds basic client_cache_size 8192 mds advanced mds_bal_fragment_size_max 500000 mds basic mds_cache_memory_limit 17179869184 mds advanced mds_cache_reservation 0.500000 mds advanced mds_max_caps_per_client 65536 mds advanced mds_min_caps_per_client 4096 mds advanced mds_recall_max_caps 16384 mds advanced mds_session_blacklist_on_timeout false # ceph config get osd.0 bluefs_buffered_io true Thanks for any pointers, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx