Quincy: mClock config propagation does not work properly

Luis Domingues <luis.domingues@xxxxxxxxx> · Thu, 03 Mar 2022 13:04:38 +0000

Hi all,

As we are doing some tests on our lab cluster, running Quincy 17.1.0, we observed some strange behavior regarding the propagation of the mClock parameters to the OSDs. Basically, when we change the profile is set on a per-recorded one, and we change to custom, the change on the different mClock parameters are not propagated.

For more details, here is how we reproduce the issue on our lab:

********************** Step 1

We start the OSDs, with this configuration set, using ceph config dump:

```

osd advanced osd_mclock_profile custom
osd advanced osd_mclock_scheduler_background_recovery_lim 512
osd advanced osd_mclock_scheduler_background_recovery_res 128
osd advanced osd_mclock_scheduler_background_recovery_wgt 3
osd advanced osd_mclock_scheduler_client_lim 80
osd advanced osd_mclock_scheduler_client_res 30
osd advanced osd_mclock_scheduler_client_wgt 1 osd advanced osd_op_queue mclock_scheduler *
```

And we can observe that this is what the OSD is running, using ceph daemon osd.X config show:

```
"osd_mclock_profile": "custom",
"osd_mclock_scheduler_anticipation_timeout": "0.000000",
"osd_mclock_scheduler_background_best_effort_lim": "999999",
"osd_mclock_scheduler_background_best_effort_res": "1",
"osd_mclock_scheduler_background_best_effort_wgt": "1",
"osd_mclock_scheduler_background_recovery_lim": "512",
"osd_mclock_scheduler_background_recovery_res": "128",
"osd_mclock_scheduler_background_recovery_wgt": "3",
"osd_mclock_scheduler_client_lim": "80",
"osd_mclock_scheduler_client_res": "30",
"osd_mclock_scheduler_client_wgt": "1",
"osd_mclock_skip_benchmark": "false",
"osd_op_queue": "mclock_scheduler",
```

At this point, is we change something, the change can be viewed on the osd. Let's say we change the background recovery to 100:

`ceph config set osd osd_mclock_scheduler_background_recovery_res 100`

The change has been set properly on the OSDs:

```
"osd_mclock_profile": "custom",
"osd_mclock_scheduler_anticipation_timeout": "0.000000",
"osd_mclock_scheduler_background_best_effort_lim": "999999",
"osd_mclock_scheduler_background_best_effort_res": "1",
"osd_mclock_scheduler_background_best_effort_wgt": "1",
"osd_mclock_scheduler_background_recovery_lim": "512",
"osd_mclock_scheduler_background_recovery_res": "100",
"osd_mclock_scheduler_background_recovery_wgt": "3",
"osd_mclock_scheduler_client_lim": "80",
"osd_mclock_scheduler_client_res": "30",
"osd_mclock_scheduler_client_wgt": "1",
"osd_mclock_skip_benchmark": "false",
"osd_op_queue": "mclock_scheduler",
```

********************** Step 2

We change the profile to high_recovery_ops, and remove the old configuration

```
ceph config set osd osd_mclock_profile high_recovery_ops
ceph config rm osd osd_mclock_scheduler_background_recovery_lim
ceph config rm osd osd_mclock_scheduler_background_recovery_res
ceph config rm osd osd_mclock_scheduler_background_recovery_wgt
ceph config rm osd osd_mclock_scheduler_client_lim
ceph config rm osd osd_mclock_scheduler_client_resceph config rm osd osd_mclock_scheduler_client_wgt
```

The config contains this now:

```
osd advanced osd_mclock_profile high_recovery_ops
osd advanced osd_op_queue mclock_scheduler *
```

And we can see that the configuration was propagated to the OSDs:

```
"osd_mclock_profile": "high_recovery_ops",
"osd_mclock_scheduler_anticipation_timeout": "0.000000",
"osd_mclock_scheduler_background_best_effort_lim": "999999",
"osd_mclock_scheduler_background_best_effort_res": "1",
"osd_mclock_scheduler_background_best_effort_wgt": "2",
"osd_mclock_scheduler_background_recovery_lim": "343",
"osd_mclock_scheduler_background_recovery_res": "103",
"osd_mclock_scheduler_background_recovery_wgt": "2",
"osd_mclock_scheduler_client_lim": "137",
"osd_mclock_scheduler_client_res": "51",
"osd_mclock_scheduler_client_wgt": "1",
"osd_mclock_skip_benchmark": "false",
"osd_op_queue": "mclock_scheduler",

```

********************** Step 3

The issue comes now, when we try to go back to custom profile:

```
ceph config set osd osd_mclock_profile custom
ceph config set osd osd_mclock_scheduler_background_recovery_lim 512
ceph config set osd osd_mclock_scheduler_background_recovery_res 128
ceph config set osd osd_mclock_scheduler_background_recovery_wgt 3
ceph config set osd osd_mclock_scheduler_client_lim 80
ceph config set osd osd_mclock_scheduler_client_res 30ceph config set osd osd_mclock_scheduler_client_wgt 1

```

The ceph configuration looks good:

```
osd advanced osd_mclock_profile custom
osd advanced osd_mclock_scheduler_background_recovery_lim 512
osd advanced osd_mclock_scheduler_background_recovery_res 128
osd advanced osd_mclock_scheduler_background_recovery_wgt 3
osd advanced osd_mclock_scheduler_client_lim 80
osd advanced osd_mclock_scheduler_client_res 30
osd advanced osd_mclock_scheduler_client_wgt 1 osd advanced osd_op_queue mclock_scheduler *
```

But the lim, res and wgt on the OSDs still have the old high_recovery config, even if the profile is custom:

```
"osd_mclock_profile": "custom",
"osd_mclock_scheduler_anticipation_timeout": "0.000000",
"osd_mclock_scheduler_background_best_effort_lim": "999999",
"osd_mclock_scheduler_background_best_effort_res": "1",
"osd_mclock_scheduler_background_best_effort_wgt": "2",
"osd_mclock_scheduler_background_recovery_lim": "343",
"osd_mclock_scheduler_background_recovery_res": "103",
"osd_mclock_scheduler_background_recovery_wgt": "2",
"osd_mclock_scheduler_client_lim": "137",
"osd_mclock_scheduler_client_res": "51",
"osd_mclock_scheduler_client_wgt": "1",
"osd_mclock_skip_benchmark": "false",
"osd_op_queue": "mclock_scheduler",
```

At this point, we can try to change what we want, the only changes that as an effect on the mClock parameters is if we change to a pre-configured profile, such as balanced, high_recovery_ops, high_client_ops. Setting custom again will leave the OSDs on their last pre-configured profile.

The only way I found to be able to change this parameters, is to restart the OSDs.

This look to me like a bug, but tell me if this is expected.

Regards,
Luis Domingues
Proton AG
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx