Re: Quincy recovery load

Daniel Williams <danielwoz@xxxxxxxxx> · Wed, 20 Jul 2022 03:07:50 +0800

Just incase people don't know
osd_op_queue = "wpq"
requires an OSD restart.

And further to my theory about the spin lock or similar, increasing my
recovery by 4-16x using wpq sees my cpu rise to 10-15% ( from 3% )...
but using mclock, even at very very conservative recovery settings sees a
median CPU usage of some multiple of 100% (eg. a multiple of a machine
core/thread usage per osd).

On Tue, Jul 19, 2022 at 4:18 PM Daniel Williams <danielwoz@xxxxxxxxx> wrote:

> Also never had problems with backfill / rebalance / recovery but now seen
> runaway CPU usage even with very conservative recovery settings after
> upgrading to quincy from pacific.
>
> osd_recovery_sleep_hdd = 0.1
> osd_max_backfills = 1
> osd_recovery_max_active = 1
> osd_recovery_delay_start = 600
>
> Tried:
> osd_mclock_profile = "high_recovery_ops"
> It did not help.
>
> The CPU eventually runs away so much (regardless of config) that the OSD
> gets health check problems, and causes even more problems, so I tried
> nodown,noout,noscrub,nodeep-scrub
> But none of that helped progress the recovery forward either.
>
> The only way back to a health cluster for now seems to be
> ceph osd set norebalance
>
> When toggling off rebalance and the cluster is slowly finishing the
> rebalances in progress, I noticed that the whole cluster has almost no IO
> on the disks, except on one of the hosts 100% on a single disk is bouncing
> around from disk to disk.
>
> Example of the host with the bouncing load:
> root@ceph-server-04:~# !dstat
> dstat -cd --disk-util --disk-tps --net
> ----total-usage---- -dsk/total-
> nvme-sdb--sda--sdc--sdd--sde--sdf--sdg--sdh--sdi--sdj--sdk- -dsk/total-
> -net/total-
> usr sys idl wai stl| read
>  writ|util:util:util:util:util:util:util:util:util:util:util:util|#read
> #writ| recv  send
>  74  12   9   3   0|2542k  246M|7.49:99.3:   0:   0:   0:27.2:   0:99.3:
> 0:   0:   0:   0|   9   636 |1251k  829k
>  75  11  10   3   0|  29M  254M|7.65: 101:   0:   0:74.1:20.1:   0: 101:
> 0:   0:   0:   0| 205   686 |4246k 7841k
>  61  26   9   3   0|6340k  250M|2.81: 101:   0:   0:12.9:   0:   0:99.7:
> 0:   0:   0:   0|  45   660 |  35M   35M
>  69  20   8   2   0|   0   243M|5.20:98.5:   0:   0:   0:   0:   0:99.7:
> 0:   0:   0:   0|   0   649 | 650k  442k
>  71  20   8   0   0|   0   150M|5.13:87.9:   0:   0:   0:   0:   0:68.2:
> 0:   0:   0:   0|   0   360 | 703k  443k
>  72  16  11  57   0|8168B   51M|5.18:   0:   0:   0:   0:   0:   0:1.99:
> 0:   0:86.5:   0|   2   129 | 702k  524k
>  72  16  11   1   0|   0  5865k|7.28:   0:   0:   0:   0:   0:   0:   0:
> 0:   0:90.6:   0|   0    36 |1578k 1184k
>  71  16  12   0   0|   0  6519k|7.25:   0:   0:   0:   0:   0:   0:   0:
> 0:   0: 112:   0|   0    38 | 904k  553k
>  75  11  11   2   0| 522k   32M|1.96:   0:   0:   0:1.96:   0:   0:   0:
> 0:   0:98.5:   0|   2    81 |1022k  847k
>  72  14  12   1   0|   0    60M|5.72:   0:   0:   0:   0:   0:   0:   0:
> 0:   0: 102:   0|   0   160 | 826k  550k
>  65  19  13   2   0|   0   124M|5.57:   0:   0:99.1:   0:   0:   0:   0:
> 0:   0:   0:   0|   0   339 | 648k  340k
>  69  17  11   2   0|   0   125M|2.82:   0:   0: 101:   0:   0:   0:   0:
> 0:   0:   0:   0|   0   333 | 694k  482k
>  75  15   9   1   0|   0   123M|3.56:   0:   0:99.3:   0:   0:   0:   0:
> 0:   0:   0:   0|   0   331 |1760k 1368k
>  79  10   9   1   0|   0   114M|2.01:   0:   0: 101:   0:   0:   0:   0:
> 0:   0:   0:   0|   0   335 | 893k  636k
>  77  14   8   0   0| 685k   72M|4.41:   0:   0:82.9:   0:   0:   0:   0:
> 0:1.20:   0:   0|   1   195 |1590k 1482k
>
> You can see that the "active" io host is not doing much network traffic.
>
> The weird part is the osds on the idle machines see huge CPU load even
> during periods of no IO. There are "some" explanations for that since
> the cluster is completely jerasure code HDDs in k=6, m=3, but it seems
> weird that such a small amount of data would be so CPU intensive to
> recovery when there is no performance degradation to client operations.
>
> My best guess is some sort of weird spin lock or equivalent waiting for
> contended io on OSDs due to a changed behaviour in responses for queued
> recovery operations?
>
>
> Setting just:
> osd_op_queue = "wpq"
> fixes my cluster, now recovery going at the same speed is using on average
> 3-6% cpu per OSD down from 100-300%.
>
>
>
>
>
>
> On Tue, Jul 12, 2022 at 7:56 PM Sridhar Seshasayee <sseshasa@xxxxxxxxxx>
> wrote:
>
>> Hi Chris,
>>
>> While we look into this, I have a couple of questions:
>>
>> 1. Did the recovery rate stay at 1 object/sec throughout? In our tests we
>> have seen that
>>     the rate is higher during the starting phase of recovery and
>> eventually
>> tapers off due
>>     to throttling by mclock.
>>
>> 2. Can you try speeding up the recovery by changing to "high_recovery_ops"
>> profile on
>>     all the OSDs to see if it improves things (both CPU load and recovery
>> rate)?
>>
>> 3. On the OSDs that showed high CPU usage, can you run the following
>> command and
>>     revert back? This just dumps the mclock settings on the OSDs.
>>
>>     sudo ceph daemon osd.N config show | grep osd_mclock
>>
>> I will update the tracker with these questions as well so that the
>> discussion can
>> continue there.
>>
>> Thanks,
>> -Sridhar
>>
>> On Tue, Jul 12, 2022 at 4:49 PM Chris Palmer <chris.palmer@xxxxxxxxx>
>> wrote:
>>
>> > I've created tracker https://tracker.ceph.com/issues/56530 for this,
>> > including info on replicating it on another cluster.
>> >
>> >
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx