Just incase people don't know osd_op_queue = "wpq" requires an OSD restart. And further to my theory about the spin lock or similar, increasing my recovery by 4-16x using wpq sees my cpu rise to 10-15% ( from 3% )... but using mclock, even at very very conservative recovery settings sees a median CPU usage of some multiple of 100% (eg. a multiple of a machine core/thread usage per osd). On Tue, Jul 19, 2022 at 4:18 PM Daniel Williams <danielwoz@xxxxxxxxx> wrote: > Also never had problems with backfill / rebalance / recovery but now seen > runaway CPU usage even with very conservative recovery settings after > upgrading to quincy from pacific. > > osd_recovery_sleep_hdd = 0.1 > osd_max_backfills = 1 > osd_recovery_max_active = 1 > osd_recovery_delay_start = 600 > > Tried: > osd_mclock_profile = "high_recovery_ops" > It did not help. > > The CPU eventually runs away so much (regardless of config) that the OSD > gets health check problems, and causes even more problems, so I tried > nodown,noout,noscrub,nodeep-scrub > But none of that helped progress the recovery forward either. > > The only way back to a health cluster for now seems to be > ceph osd set norebalance > > When toggling off rebalance and the cluster is slowly finishing the > rebalances in progress, I noticed that the whole cluster has almost no IO > on the disks, except on one of the hosts 100% on a single disk is bouncing > around from disk to disk. > > Example of the host with the bouncing load: > root@ceph-server-04:~# !dstat > dstat -cd --disk-util --disk-tps --net > ----total-usage---- -dsk/total- > nvme-sdb--sda--sdc--sdd--sde--sdf--sdg--sdh--sdi--sdj--sdk- -dsk/total- > -net/total- > usr sys idl wai stl| read > writ|util:util:util:util:util:util:util:util:util:util:util:util|#read > #writ| recv send > 74 12 9 3 0|2542k 246M|7.49:99.3: 0: 0: 0:27.2: 0:99.3: > 0: 0: 0: 0| 9 636 |1251k 829k > 75 11 10 3 0| 29M 254M|7.65: 101: 0: 0:74.1:20.1: 0: 101: > 0: 0: 0: 0| 205 686 |4246k 7841k > 61 26 9 3 0|6340k 250M|2.81: 101: 0: 0:12.9: 0: 0:99.7: > 0: 0: 0: 0| 45 660 | 35M 35M > 69 20 8 2 0| 0 243M|5.20:98.5: 0: 0: 0: 0: 0:99.7: > 0: 0: 0: 0| 0 649 | 650k 442k > 71 20 8 0 0| 0 150M|5.13:87.9: 0: 0: 0: 0: 0:68.2: > 0: 0: 0: 0| 0 360 | 703k 443k > 72 16 11 57 0|8168B 51M|5.18: 0: 0: 0: 0: 0: 0:1.99: > 0: 0:86.5: 0| 2 129 | 702k 524k > 72 16 11 1 0| 0 5865k|7.28: 0: 0: 0: 0: 0: 0: 0: > 0: 0:90.6: 0| 0 36 |1578k 1184k > 71 16 12 0 0| 0 6519k|7.25: 0: 0: 0: 0: 0: 0: 0: > 0: 0: 112: 0| 0 38 | 904k 553k > 75 11 11 2 0| 522k 32M|1.96: 0: 0: 0:1.96: 0: 0: 0: > 0: 0:98.5: 0| 2 81 |1022k 847k > 72 14 12 1 0| 0 60M|5.72: 0: 0: 0: 0: 0: 0: 0: > 0: 0: 102: 0| 0 160 | 826k 550k > 65 19 13 2 0| 0 124M|5.57: 0: 0:99.1: 0: 0: 0: 0: > 0: 0: 0: 0| 0 339 | 648k 340k > 69 17 11 2 0| 0 125M|2.82: 0: 0: 101: 0: 0: 0: 0: > 0: 0: 0: 0| 0 333 | 694k 482k > 75 15 9 1 0| 0 123M|3.56: 0: 0:99.3: 0: 0: 0: 0: > 0: 0: 0: 0| 0 331 |1760k 1368k > 79 10 9 1 0| 0 114M|2.01: 0: 0: 101: 0: 0: 0: 0: > 0: 0: 0: 0| 0 335 | 893k 636k > 77 14 8 0 0| 685k 72M|4.41: 0: 0:82.9: 0: 0: 0: 0: > 0:1.20: 0: 0| 1 195 |1590k 1482k > > You can see that the "active" io host is not doing much network traffic. > > The weird part is the osds on the idle machines see huge CPU load even > during periods of no IO. There are "some" explanations for that since > the cluster is completely jerasure code HDDs in k=6, m=3, but it seems > weird that such a small amount of data would be so CPU intensive to > recovery when there is no performance degradation to client operations. > > My best guess is some sort of weird spin lock or equivalent waiting for > contended io on OSDs due to a changed behaviour in responses for queued > recovery operations? > > > Setting just: > osd_op_queue = "wpq" > fixes my cluster, now recovery going at the same speed is using on average > 3-6% cpu per OSD down from 100-300%. > > > > > > > On Tue, Jul 12, 2022 at 7:56 PM Sridhar Seshasayee <sseshasa@xxxxxxxxxx> > wrote: > >> Hi Chris, >> >> While we look into this, I have a couple of questions: >> >> 1. Did the recovery rate stay at 1 object/sec throughout? In our tests we >> have seen that >> the rate is higher during the starting phase of recovery and >> eventually >> tapers off due >> to throttling by mclock. >> >> 2. Can you try speeding up the recovery by changing to "high_recovery_ops" >> profile on >> all the OSDs to see if it improves things (both CPU load and recovery >> rate)? >> >> 3. On the OSDs that showed high CPU usage, can you run the following >> command and >> revert back? This just dumps the mclock settings on the OSDs. >> >> sudo ceph daemon osd.N config show | grep osd_mclock >> >> I will update the tracker with these questions as well so that the >> discussion can >> continue there. >> >> Thanks, >> -Sridhar >> >> On Tue, Jul 12, 2022 at 4:49 PM Chris Palmer <chris.palmer@xxxxxxxxx> >> wrote: >> >> > I've created tracker https://tracker.ceph.com/issues/56530 for this, >> > including info on replicating it on another cluster. >> > >> > >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx