Are these nvme or ssd drives? How stable is it when a rebalance or degradation recovery kicks off? Can it handle it without crashing osds? Why I'm asking I can see in my cluster during recovery 1 osd/disk can be maxed out already, how can be with 4? I'm doing the same test at the moment (4 osd on the ssd and nvme drives also in my objectstore cluster) just haven't reached the billions of object/bucket yet when I want to destroy 1 host. Istvan Szabo Senior Infrastructure Engineer --------------------------------------------------- Agoda Services Co., Ltd. e: istvan.szabo@xxxxxxxxx --------------------------------------------------- -----Original Message----- From: Frank Schilder <frans@xxxxxx> Sent: Wednesday, December 1, 2021 5:20 PM To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>; Stefan Kooman <stefan@xxxxxx>; ceph-users <ceph-users@xxxxxxx> Subject: Re: Re: ceph-osd iodepth for high-performance SSD OSDs Email received from the internet. If in doubt, don't click any link nor open any attachment ! ________________________________ Hi Szabo, no, I didn't. I deployed 4 OSDs per drive and get maybe 25-50% of their performance out. The kv_sync thread is the bottleneck. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx> Sent: 01 December 2021 09:00:02 To: Frank Schilder; Stefan Kooman; ceph-users Subject: RE: Re: ceph-osd iodepth for high-performance SSD OSDs Hi Frank, Have you found some tuning related to the bstore_kv_sync how to do it without put more osd/disk? Istvan Szabo Senior Infrastructure Engineer --------------------------------------------------- Agoda Services Co., Ltd. e: istvan.szabo@xxxxxxxxx --------------------------------------------------- -----Original Message----- From: Frank Schilder <frans@xxxxxx> Sent: Tuesday, October 26, 2021 4:44 PM To: Stefan Kooman <stefan@xxxxxx>; ceph-users <ceph-users@xxxxxxx> Subject: Re: ceph-osd iodepth for high-performance SSD OSDs Email received from the internet. If in doubt, don't click any link nor open any attachment ! ________________________________ Hi Stefan, thanks a lot for this information. I increased osd_op_num_threads_per_shard with little effect (I did restart and checked with config show that the value is applied). I'm afraid I'm bound by the bstore_kv_sync as explained in an this thread discussing this in great detail (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-March/033522.html). There was also a discussion of giving every shard its own kv store to increase concurrency on the kv store(s), but it seems not implemented - at least not in mimic. I'm afraid with my disks I get an effective queue depth of 1-2 per active bstore_kv_sync thread (meaning: per OSD daemon), which more or less matches the aggregated performance I see. Thanks and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Stefan Kooman <stefan@xxxxxx> Sent: 26 October 2021 11:14:28 To: Frank Schilder; ceph-users Subject: Re: Re: ceph-osd iodepth for high-performance SSD OSDs On 10/26/21 10:22, Frank Schilder wrote: > It looks like the bottleneck is the bstore_kv_sync thread, there seems to be only one running per OSD daemon independent of shard number. This would imply a rather low effective queue depth per OSD daemon. Are there ways to improve this other than deploying even more OSD daemons per OSD? Regarding num op threads, see Slide 23 of [1]: . osd_op_num_threads_per_shard * osd_op_num_shards . Keep at number_of_threads_of_your_cpu_can_handle - async_msgr_op_threads - 3..5 - For example, "osd op num shards = 8", "osd op num threads per shard = 2" and "ms async op threads = 3" for 22-core CPU with HT/SMT (2*8+3 = 19, 3 threads left for Bluestore, RocksDB, etc.) . Increase in case of slower NVMe to improve IOPS and latency in random reads/writes - Offset NVMe processing time (iowait) by using other thread to do CPU-consuming work in the meantime . Don't set too high, or context switches will kill your performance . Too low values will cause your OSDs to stall . Change requires OSD restart Gr. Stefan [1]: https://static.sched.com/hosted_files/cephalocon2019/d6/ceph%20on%20nvme%20barcelona%202019.pdf _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx