Re: 4k IOPS: miserable performance in All-SSD cluster

Peter Linder <peter.linder@xxxxxxxxxxxxxx> · Tue, 26 Nov 2024 21:53:03 +0100

In my experience, ceph will add around 1ms even if only on localhost. If 
this is in the client code or on the OSD's, I dont really know. I don't 
even know the precise reason, but the latency is there nevertheless. 
Perhaps you can find the reason here among the tradeoffs ceph and 
similar systems have to make to ensure consistency even if a partition 
can happen at any time:

https://en.wikipedia.org/wiki/PACELC_theorem

With size=3, a write will go first to the primary OSD for the PG, 
(0,1ms), then from there to two more PGs (in parallell), so 0,2ms more 
total RTT. Then back to the client, 0,1ms. That is, very roughly, 1,4ms 
even if storage latency is 0 which it never is even for ssds.

If you set size=1, you can skip the step where the primary OSD 
replicates to the 2 replicas, but you still have cephs internal latency 
as well as the network latency to reach the primary OSD for whatever PG 
the object will belong to which could be on any server. So expect a 
small improvement but not too much.

With that said, a single thread will not exceed 1000 iops ever in a 
typical setup.

/Peter

Den 2024-11-26 kl. 21:09, skrev Martin Gerhard Loschwitz:
that would mean 2-3ms latency between hosts hanging above each other 
in the same rack connected to the same switches.

Ping shows 0,2ms of latency though for all three affected clusters.

So roughly 5000 iops. We can certainly add Ceph latency to that, but 
that would mean Ceph eats 99% of the available performance, wouldn't it?

Also, that wouldn't explain why we're seeing a bit of improvement with 
size=1 for a specific pool but not a massive improvement, given that 
at least half of the latency is taken out of the equation in that case.

Best regards
Martin

Peter Linder <peter.linder@xxxxxxxxxxxxxx> schrieb am Di. 26. Nov. 
2024 um 20:52:

    With qd=1 (queue depth?) and a single thread, this isn't totally
    unreasonable.

    Ceph will have an internal latency of around 1ms or so, add some
    network
    to that and an operation can take 2-3ms. With a single operation in
    flight all the time, this means 333-500 operations per second. With
    hdds, even fewer.

    What happens if you try again with many more threads?

    Den 2024-11-25 kl. 15:22, skrev Martin Gerhard Loschwitz:
    > Folks,
    >
    > I am getting somewhat desperate debugging multiple setups here
    within the same environment. Three clusters, two SSD-only, one
    HDD-only, and what they all have in common is abysmal 4k IOPS
    performance when measuring with „rados bench“. Abysmal means: In
    an All-SSD cluster I will get roughly 400 IOPS over more than 250
    devices. I’ve know SAS-SSDs are not ideal, but 250 looks a bit on
    the low side of things to me.
    >
    > In the second cluster, also All-SSD based, I get roughly 120 4k
    IOPS. And the HDD-only cluster delivers 60 4k IOPS. The latter
    both with substantially fewer devices, granted. But even with 20
    HDDs, 68 4k IOPS seems like a very bad value to me.
    >
    > I’ve tried to rule out everything I know of: BIOS
    misconfigurations, HBA problems, networking trouble (I am seeing
    comparably bad values with a size=1 pool) and so further and so
    on. But to no avail. Has anybody dealt with something similar on
    Dell hardware or in general? What could cause such extremely bad
    benchmark results?
    >
    > I measure with rados bench and qd=1 at 4k block size. „ceph tell
    osd bench“ with 4k blocks yields 30k+ IOPS for every single device
    in the big cluster, and all that leads to is 400 IOPS in total
    when writing to it? Even with no replication in place? That looks
    a bit off, doesn't it? Any help will be greatly appreciated, thank
    you very much in advance. Even a pointer to the right direction
    would be held in high esteem right now. Thank you very much in
    advance!
    >
    > Best regards
    > Martin
    > _______________________________________________
    > ceph-users mailing list -- ceph-users@xxxxxxx
    > To unsubscribe send an email to ceph-users-leave@xxxxxxx
    _______________________________________________
    ceph-users mailing list -- ceph-users@xxxxxxx
    To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx