Re: ceph-osd iodepth for high-performance SSD OSDs

Frank Schilder <frans@xxxxxx> · Tue, 26 Oct 2021 09:08:17 +0000

Performance tests I did with recent SAS NVMe SSD drives indicate that these require a very high degree of concurrency to get close to spec performance. I agree that with standard data SSD drives 1-2 OSD daemons are enough. With high-performance drives it is a different story. Pushing a reasonable number of IOs down to the drive requires heavy concurrency.

The ceph-osd daemons just about manage to get 350% CPU each and it looks like the bstore_kv_sync (which forces IO to be serialised to some degree) is the bottleneck. However, this sync thread itself only uses 50% CPU, so I would expect that some tweaking is still possible here.

PCIe NVMes are different. Test I did with expensive ones show that already with iodepth=1 one can achieve full bus bandwidth (spec GT/s) with random 4K IO. Here, concurrency is *not* required by the drive itself, it just helps to reduce the impact of ceph's latency on aggregated performance. For PCIe NVMe drives I would expect the bstore_kv_sync thread to be CPU bound.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
Sent: 26 October 2021 10:53:10
To: Frank Schilder
Cc: ceph-users
Subject: Re:  Re: ceph-osd iodepth for high-performance SSD OSDs

Isn’t it too much for ssd 4 osd? Normally nvme is suitable for 4osd isn’t it?

Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo@xxxxxxxxx<mailto:istvan.szabo@xxxxxxxxx>
---------------------------------------------------

On 2021. Oct 26., at 10:23, Frank Schilder <frans@xxxxxx> wrote:

Email received from the internet. If in doubt, don't click any link nor open any attachment !
________________________________

It looks like the bottleneck is the bstore_kv_sync thread, there seems to be only one running per OSD daemon independent of shard number. This would imply a rather low effective queue depth per OSD daemon. Are there ways to improve this other than deploying even more OSD daemons per OSD?

Thanks!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: 26 October 2021 09:41:44
To: ceph-users
Subject:  ceph-osd iodepth for high-performance SSD OSDs

Hi all,

we deployed a pool with high-performance SSDs and I'm testing aggregated performance. We seem to hit a bottleneck that is not caused by drive performance. My best guess at the moment is, that the effective iodepth of the OSD daemons is too low for these drives. I have 4 OSDs per drive and I vaguely remember that there are parameters to modify the degree of concurrency an OSD daemon uses to write to disk. Are these parameters the ones I'm looking for:

   "osd_op_num_shards": "0",
   "osd_op_num_shards_hdd": "5",
   "osd_op_num_shards_ssd": "8",
   "osd_op_num_threads_per_shard": "0",
   "osd_op_num_threads_per_shard_hdd": "1",
   "osd_op_num_threads_per_shard_ssd": "2",

How do these apply if I have these drives in a custom device class rbd_perf? Could I set, for example

ceph config set osd/class:rbd_perf osd_op_num_threads_per_shard 4

to increase concurrency on this particular device class only? Is it possible to increase the number of shards at run-time?

Thanks for your help!

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx