Re: ceph-osd iodepth for high-performance SSD OSDs

Frank Schilder <frans@xxxxxx> · Thu, 2 Dec 2021 16:35:58 +0000

Hi Szabo,

the sleep simply means that before executing the next recovery operation, it will wait for a certain amount of time to see if client ops come in and execute these. As far as I can tell, the sleep seems to include the execution time of the recovery op itself. The way I tune this is as follows:

- set osd norecover
- mark an OSD of a pool as out
- wait for peering to finish
- set osd_recovery_sleep for the relevant device class to 0
- unset osd norecover

At this point, the OSDs will start recovery at the max they can do. This gives the max baseline for GB/s, objects/s and keys/s that the pool can handle. Now I simply start increasing osd_recovery_sleep slowly until I get the required share for OPS, for example, 1/4 clients, 3/4 recovery.

How do I get 1/4 client? I know the average client load on the pool. For us, this average load is small compared with the max IO of recovery (sleep=0). I just make sure that the sleep value allows for, say, 4*average load to proceed within the aggregated OPS budget of the pool.

With these settings, average users don't notice a change in quality of service in case of recovery. Furthermore, the recovery period is really short, of the order of 5-10 minutes for an OSD fail.

Hope that helps.
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
Sent: 02 December 2021 11:03:29
To: Frank Schilder; Stefan Kooman; ceph-users
Subject: RE:  Re: ceph-osd iodepth for high-performance SSD OSDs

Hello,

Thank you the details.

For me in objectstore these 2 values would be interesting:
osd        class:ssd      advanced osd_max_backfills  12
osd        class:ssd      advanced osd_recovery_sleep  0.002500

In the osd help I always confused with this statement:
"desc": "Time in seconds to sleep before next recovery or backfill op",

But what this super low value help on this? Break a long lasting connections?

Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo@xxxxxxxxx
---------------------------------------------------

-----Original Message-----
From: Frank Schilder <frans@xxxxxx>
Sent: Wednesday, December 1, 2021 8:59 PM
To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>; Stefan Kooman <stefan@xxxxxx>; ceph-users <ceph-users@xxxxxxx>
Subject: Re:  Re: ceph-osd iodepth for high-performance SSD OSDs

Email received from the internet. If in doubt, don't click any link nor open any attachment !
________________________________

The disks we use are KIOXIA PM6 read-intensive SAS NVMe SSDs and it is not possible to max them out with effective queue depth of 4.

I should mention that I'm still using mimic. All the reports about crashing OSDs during recovery started with nautilus - I personally believe there is a regression with the WPQ scheduler that was reported a long time ago (https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/W4M5XQRDBLXFGJGDYZALG6TQ4QBVGGAJ). The symptoms many users reported starting with nautilus are exactly what I have seen with cut-off set to low on mimic. After setting cut-off to high I never had any problems with missed heartbeats, crashing OSDs or OSDs being marked down etc. during rebalance or snaptrim operations. All these problems seem to have reappeared with nautilus and I am suspecting that Robert's hypothesis has some foundation.

For comparison, some config values:

osd        class:hdd      advanced osd_max_backfills  3
osd        class:rbd_data advanced osd_max_backfills  6
osd        class:rbd_meta advanced osd_max_backfills  12
osd        class:rbd_perf advanced osd_max_backfills  12
osd        class:ssd      advanced osd_max_backfills  12

osd        class:hdd      advanced osd_recovery_sleep  0.050000
osd        class:rbd_data advanced osd_recovery_sleep  0.025000
osd        class:rbd_meta advanced osd_recovery_sleep  0.002500
osd        class:rbd_perf advanced osd_recovery_sleep  0.002500
osd        class:ssd      advanced osd_recovery_sleep  0.002500

The rbd_perf are the high-performance disks, rbd_* disks are all SSD. The sleep values for SSDs are tuned so that recovery gets a max of 75% of the aggregated IOP/s of the SSD pools. With these settings, I see very high recovery while user-IO is not measurably affected. The  objects of a failed SSD are rebuilt within a few minutes.

If you have WPQ cut-off set to high and your OSDs loose heartbeats or crash while rebalancing with the above config values, there is a problem that I don't (yet) have.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
Sent: 01 December 2021 14:26:12
To: Frank Schilder; Stefan Kooman; ceph-users
Subject: RE:  Re: ceph-osd iodepth for high-performance SSD OSDs

Are these nvme or ssd drives?
How stable is it when a rebalance or degradation recovery kicks off? Can it handle it without crashing osds?
Why I'm asking I can see in my cluster during recovery 1 osd/disk can be maxed out already, how can be with 4?
I'm doing the same test at the moment (4 osd on the ssd and nvme drives also in my objectstore cluster) just haven't reached the billions of object/bucket yet when I want to destroy 1 host.

Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo@xxxxxxxxx
---------------------------------------------------

-----Original Message-----
From: Frank Schilder <frans@xxxxxx>
Sent: Wednesday, December 1, 2021 5:20 PM
To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>; Stefan Kooman <stefan@xxxxxx>; ceph-users <ceph-users@xxxxxxx>
Subject: Re:  Re: ceph-osd iodepth for high-performance SSD OSDs

Email received from the internet. If in doubt, don't click any link nor open any attachment !
________________________________

Hi Szabo,

no, I didn't. I deployed 4 OSDs per drive and get maybe 25-50% of their performance out. The kv_sync thread is the bottleneck.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
Sent: 01 December 2021 09:00:02
To: Frank Schilder; Stefan Kooman; ceph-users
Subject: RE:  Re: ceph-osd iodepth for high-performance SSD OSDs

Hi Frank,

Have you found some tuning related to the bstore_kv_sync how to do it without put more osd/disk?

Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo@xxxxxxxxx
---------------------------------------------------

-----Original Message-----
From: Frank Schilder <frans@xxxxxx>
Sent: Tuesday, October 26, 2021 4:44 PM
To: Stefan Kooman <stefan@xxxxxx>; ceph-users <ceph-users@xxxxxxx>
Subject:  Re: ceph-osd iodepth for high-performance SSD OSDs

Email received from the internet. If in doubt, don't click any link nor open any attachment !
________________________________

Hi Stefan,

thanks a lot for this information. I increased osd_op_num_threads_per_shard with little effect (I did restart and checked with config show that the value is applied). I'm afraid I'm bound by the bstore_kv_sync as explained in an this thread discussing this in great detail (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-March/033522.html). There was also a discussion of giving every shard its own kv store to increase concurrency on the kv store(s), but it seems not implemented - at least not in mimic. I'm afraid with my disks I get an effective queue depth of 1-2 per active bstore_kv_sync thread (meaning: per OSD daemon), which more or less matches the aggregated performance I see.

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Stefan Kooman <stefan@xxxxxx>
Sent: 26 October 2021 11:14:28
To: Frank Schilder; ceph-users
Subject: Re:  Re: ceph-osd iodepth for high-performance SSD OSDs

On 10/26/21 10:22, Frank Schilder wrote:
> It looks like the bottleneck is the bstore_kv_sync thread, there seems to be only one running per OSD daemon independent of shard number. This would imply a rather low effective queue depth per OSD daemon. Are there ways to improve this other than deploying even more OSD daemons per OSD?

Regarding num op threads, see Slide 23 of [1]:

. osd_op_num_threads_per_shard * osd_op_num_shards . Keep at number_of_threads_of_your_cpu_can_handle - async_msgr_op_threads - 3..5 - For example, "osd op num shards = 8", "osd op num threads per shard = 2" and "ms async op threads = 3" for 22-core CPU with HT/SMT (2*8+3 = 19, 3 threads left for Bluestore, RocksDB, etc.) . Increase in case of slower NVMe to improve IOPS and latency in random reads/writes - Offset NVMe processing time (iowait) by using other thread to do CPU-consuming work in the meantime . Don't set too high, or context switches will kill your performance . Too low values will cause your OSDs to stall . Change requires OSD restart

Gr. Stefan

[1]:
https://static.sched.com/hosted_files/cephalocon2019/d6/ceph%20on%20nvme%20barcelona%202019.pdf
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx