Sridhar, Thanks a lot for this explantation. It's clearer now. So at the end of the day (at least with balanced profile) it's a lower bound and no upper limit and a balanced distribution between client and cluster IOPS. Regards, Frédéric. -----Message original----- De: Sridhar <sseshasa@xxxxxxxxxx> à: Frédéric <frederic.nass@xxxxxxxxxxxxxxxx> Cc: ceph-users <ceph-users@xxxxxxxx> Envoyé: mercredi 10 janvier 2024 08:15 CET Sujet : Re: How does mclock work? Hello Frédéric, Please see answers below. Could someone please explain how mclock works regarding reads and writes? Does mclock intervene on both read and write iops? Or only on reads or only on writes? mClock schedules both read and write ops. And what type of underlying hardware performance is calculated and considered by mclock? Seems to be only write performance. Random write performance is considered for setting the maximum IOPS capacity of an OSD. This along with the sequential bandwidth capability of the OSD is used to calculate the cost per IO that is internally used by mClock for scheduling Ops. In addition, the mClock profiles use the capacity information to allocate reservation and limit for different classes of service (for e.g., client, background-recovery, scrub, snaptrim etc.). The write performance is used to set a lower bound on the amount of bandwidth to be allocated for different classes of services. For e.g., the 'balanced' profile allocates 50% of the OSD's IOPS capacity to cllent ops. In other words, a minimum guarantee of 50% of the OSD's bandwidth is allocated to client ops (read or write). If you look at the 'balanced' profile, there is no upper limit set for client ops (i.e. set to MAX) which means that reads can potentially use the maximum possible bandwidth (i.e., not contrained by max IOPS capacity) if there are no other competing ops. Please see https://docs.ceph.com/en/reef/rados/configuration/mclock-config-ref/#built-in-profiles for more information about mClock profiles. The mclock documentation shows HDDs and SSDs specific configuration options (capacity and sequential bandwidth) but nothing regarding hybrid setups and these configuration options do not distinguish reads and writes. But read and write performance are often not in par for a single drive and even less when using hybrid setups. With hybrid setups (RocksDB+WAL on SSDs or NVMes and Data on HDD), if mclock only considers write performance, it may fail to properly schedule read iops (does mclock schedule read iops?) as the calculated iops capacity would be way too high for reads. With HDD only setups (RocksDB+WAL+Data on HDD), if mclock only considers write performance, the OSD may not take advantage of higher read performance. Can someone please shed some light on this? As mentioned above, as long as there are no competing ops, the mClock profiles ensure that there is nothing constraining client ops from using the full available bandwidth of an OSD for both reads and writes regardless of the type of setup (hybrid, HDD, SSD) being employed. The important aspect is to ensure that the set IOPS capacity for the OSD reflects a fairly accurate representation of the underlying device capability. This is because the reservation criteria based the IOPS capacity helps maintain an acceptable level of performance with other active competing ops. You could run some synthetic benchmarks to ensure that read and write performance are along expected lines with the default mClock profile to confirm the above. I hope this helps. -Sridhar _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx