Re: Performance impact of Heterogeneous environment

Mark Nelson <mark.nelson@xxxxxxxxx> · Fri, 19 Jan 2024 07:45:12 -0600

On 1/18/24 03:40, Frank Schilder wrote:

For multi- vs. single-OSD per flash drive decision the following test might be useful:

We found dramatic improvements using multiple OSDs per flash drive with octopus *if* the bottleneck is the kv_sync_thread. Apparently, each OSD has only one and this thread is effectively sequentializing otherwise async IO if saturated.

There was a dev discussion about having more kv_sync_threads per OSD daemon by splitting up rocks-dbs for PGs, but I don't know if this ever materialized.

I advocated for it at one point, but Sage was pretty concerned about how 
much we'd be disturbing Bluestore's write path. Instead, Adam ended up 
implementing the column family sharding inside RocksDB which got us some 
(but not all) of the benefit.

A lot of the work that has gone into refactoring the RocksDB settings in 
Reef has been to help mitigate some of the overhead in the kv sync 
thread.  The gist of it is that we are trying to balance keeping the 
memtables large enough to avoid letting short lived items like pg log 
entries from regularly leaking into the DB, while simultaneously keeping 
the memtables as small as possible to reduce the number of comparisons 
RocksDB needs to do to keep them in sorted order during inserts (which 
is a significant overhead in the kv sync thread during heavy small 
random writes).  This is also part of the reason that Igor was 
experimenting with implementing a native bluestore WAL rather than 
relying on the one in RocksDB.

My guess is that for good NVMe drives it is possible that a single kv_sync_thread can saturate the device and there will be no advantage of having more OSDs/device. On not so good drives (SATA/SAS flash) multi-OSD deployments usually are better, because the on-disk controller requires concurrency to saturate the drive. Its not possible to saturate usual SAS-/SATA- SSDs with iodepth=1.

Oddly enough, we do see some effect on small random reads.  The way that 
the async msgr / shards / threads interact doesn't scale well past about 
14-16 cpu threads (and increasing the shard/thread counts has 
complicated effects, it may not always help).  If you look at that 1 vs 
2 NVMe article on the ceph.io page, you'll see that once you hit the 14 
CPU threads the 2 OSD/NVMe configuration keeps scaling but the single 
OSD configuration tops out.

With good NVME drives I have seen fio-tests with direct IO saturate the drive with 4K random IO and iodepth=1. You need enough PCI-lanes per drive for that and I could imagine that here 1 OSD/drive is sufficient. For such drives, storage access quickly becomes CPU bound, so some benchmarking taking all system properties into account is required. If you are already CPU bound (too many NVMe drives per core, many standard servers with 24+ NVMe drives have that property) there is no point adding extra CPU load with more OSD daemons.

Don't just look at single disks, look at the whole system.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Bailey Allison <ballison@xxxxxxxxxxxx>
Sent: Thursday, January 18, 2024 12:36 AM
To: ceph-users@xxxxxxx
Subject:  Re: Performance impact of Heterogeneous environment

+1 to this, great article and great research. Something we've been keeping a very close eye on ourselves.

Overall we've mostly settled on the old keep it simple stupid methodology with good results. Especially as the benefits have gotten less beneficial the more recent your ceph version, and have been rocking with single OSD/NVMe, but as always everything is workload dependant and there is sometimes a need for doubling up 😊

Regards,

Bailey

-----Original Message-----
From: Maged Mokhtar <mmokhtar@xxxxxxxxxxx>
Sent: January 17, 2024 4:59 PM
To: Mark Nelson <mark.nelson@xxxxxxxxx>; ceph-users@xxxxxxx
Subject:  Re: Performance impact of Heterogeneous
environment

Very informative article you did Mark.

IMHO if you find yourself with very high per-OSD core count, it may be logical
to just pack/add more nvmes per host, you'd be getting the best price per
performance and capacity.

/Maged

On 17/01/2024 22:00, Mark Nelson wrote:
It's a little tricky.  In the upstream lab we don't strictly see an
IOPS or average latency advantage with heavy parallelism by running
muliple OSDs per NVMe drive until per-OSD core counts get very high.
There does seem to be a fairly consistent tail latency advantage even
at moderately low core counts however.  Results are here:

https://ceph.io/en/news/blog/2023/reef-osds-per-nvme/

Specifically for jitter, there is probably an advantage to using 2
cores per OSD unless you are very CPU starved, but how much that
actually helps in practice for a typical production workload is
questionable imho.  You do pay some overhead for running 2 OSDs per
NVMe as well.

Mark

On 1/17/24 12:24, Anthony D'Atri wrote:
Conventional wisdom is that with recent Ceph releases there is no
longer a clear advantage to this.

On Jan 17, 2024, at 11:56, Peter Sabaini <peter@xxxxxxxxxx> wrote:

One thing that I've heard people do but haven't done personally with
fast NVMes (not familiar with the IronWolf so not sure if they
qualify) is partition them up so that they run more than one OSD
(say 2 to 4) on a single NVMe to better utilize the NVMe bandwidth.
See
https://ceph.com/community/bluestore-default-vs-tuned-
performance-co
mparison/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email
to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Best Regards,
Mark Nelson
Head of Research and Development

Clyso GmbH
p: +49 89 21552391 12 | a: Minnesota, USA
w: https://clyso.com | e: mark.nelson@xxxxxxxxx

We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx