Also never had problems with backfill / rebalance / recovery but now seen runaway CPU usage even with very conservative recovery settings after upgrading to quincy from pacific. osd_recovery_sleep_hdd = 0.1 osd_max_backfills = 1 osd_recovery_max_active = 1 osd_recovery_delay_start = 600 Tried: osd_mclock_profile = "high_recovery_ops" It did not help. The CPU eventually runs away so much (regardless of config) that the OSD gets health check problems, and causes even more problems, so I tried nodown,noout,noscrub,nodeep-scrub But none of that helped progress the recovery forward either. The only way back to a health cluster for now seems to be ceph osd set norebalance When toggling off rebalance and the cluster is slowly finishing the rebalances in progress, I noticed that the whole cluster has almost no IO on the disks, except on one of the hosts 100% on a single disk is bouncing around from disk to disk. Example of the host with the bouncing load: root@ceph-server-04:~# !dstat dstat -cd --disk-util --disk-tps --net ----total-usage---- -dsk/total- nvme-sdb--sda--sdc--sdd--sde--sdf--sdg--sdh--sdi--sdj--sdk- -dsk/total- -net/total- usr sys idl wai stl| read writ|util:util:util:util:util:util:util:util:util:util:util:util|#read #writ| recv send 74 12 9 3 0|2542k 246M|7.49:99.3: 0: 0: 0:27.2: 0:99.3: 0: 0: 0: 0| 9 636 |1251k 829k 75 11 10 3 0| 29M 254M|7.65: 101: 0: 0:74.1:20.1: 0: 101: 0: 0: 0: 0| 205 686 |4246k 7841k 61 26 9 3 0|6340k 250M|2.81: 101: 0: 0:12.9: 0: 0:99.7: 0: 0: 0: 0| 45 660 | 35M 35M 69 20 8 2 0| 0 243M|5.20:98.5: 0: 0: 0: 0: 0:99.7: 0: 0: 0: 0| 0 649 | 650k 442k 71 20 8 0 0| 0 150M|5.13:87.9: 0: 0: 0: 0: 0:68.2: 0: 0: 0: 0| 0 360 | 703k 443k 72 16 11 57 0|8168B 51M|5.18: 0: 0: 0: 0: 0: 0:1.99: 0: 0:86.5: 0| 2 129 | 702k 524k 72 16 11 1 0| 0 5865k|7.28: 0: 0: 0: 0: 0: 0: 0: 0: 0:90.6: 0| 0 36 |1578k 1184k 71 16 12 0 0| 0 6519k|7.25: 0: 0: 0: 0: 0: 0: 0: 0: 0: 112: 0| 0 38 | 904k 553k 75 11 11 2 0| 522k 32M|1.96: 0: 0: 0:1.96: 0: 0: 0: 0: 0:98.5: 0| 2 81 |1022k 847k 72 14 12 1 0| 0 60M|5.72: 0: 0: 0: 0: 0: 0: 0: 0: 0: 102: 0| 0 160 | 826k 550k 65 19 13 2 0| 0 124M|5.57: 0: 0:99.1: 0: 0: 0: 0: 0: 0: 0: 0| 0 339 | 648k 340k 69 17 11 2 0| 0 125M|2.82: 0: 0: 101: 0: 0: 0: 0: 0: 0: 0: 0| 0 333 | 694k 482k 75 15 9 1 0| 0 123M|3.56: 0: 0:99.3: 0: 0: 0: 0: 0: 0: 0: 0| 0 331 |1760k 1368k 79 10 9 1 0| 0 114M|2.01: 0: 0: 101: 0: 0: 0: 0: 0: 0: 0: 0| 0 335 | 893k 636k 77 14 8 0 0| 685k 72M|4.41: 0: 0:82.9: 0: 0: 0: 0: 0:1.20: 0: 0| 1 195 |1590k 1482k You can see that the "active" io host is not doing much network traffic. The weird part is the osds on the idle machines see huge CPU load even during periods of no IO. There are "some" explanations for that since the cluster is completely jerasure code HDDs in k=6, m=3, but it seems weird that such a small amount of data would be so CPU intensive to recovery when there is no performance degradation to client operations. My best guess is some sort of weird spin lock or equivalent waiting for contended io on OSDs due to a changed behaviour in responses for queued recovery operations? Setting just: osd_op_queue = "wpq" fixes my cluster, now recovery going at the same speed is using on average 3-6% cpu per OSD down from 100-300%. On Tue, Jul 12, 2022 at 7:56 PM Sridhar Seshasayee <sseshasa@xxxxxxxxxx> wrote: > Hi Chris, > > While we look into this, I have a couple of questions: > > 1. Did the recovery rate stay at 1 object/sec throughout? In our tests we > have seen that > the rate is higher during the starting phase of recovery and eventually > tapers off due > to throttling by mclock. > > 2. Can you try speeding up the recovery by changing to "high_recovery_ops" > profile on > all the OSDs to see if it improves things (both CPU load and recovery > rate)? > > 3. On the OSDs that showed high CPU usage, can you run the following > command and > revert back? This just dumps the mclock settings on the OSDs. > > sudo ceph daemon osd.N config show | grep osd_mclock > > I will update the tracker with these questions as well so that the > discussion can > continue there. > > Thanks, > -Sridhar > > On Tue, Jul 12, 2022 at 4:49 PM Chris Palmer <chris.palmer@xxxxxxxxx> > wrote: > > > I've created tracker https://tracker.ceph.com/issues/56530 for this, > > including info on replicating it on another cluster. > > > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx