Re: 4k IOPS: miserable performance in All-SSD cluster

Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> · Tue, 26 Nov 2024 14:24:29 -0500

Martin, are MONs set up on the same hosts, or is there latency to them by
any chance?
--
Alex Gorbachev
https://alextelescope.blogspot.com

On Tue, Nov 26, 2024 at 5:20 AM Martin Gerhard Loschwitz <
martin.loschwitz@xxxxxxxxxxxxx> wrote:

> Hi Alex,
>
> thank you for the reply. Here are all the steps we’ve done in the last
> weeks to reduce complexity (we’re focussing on the HDD cluster for now in
> which we are seeing the worst results in relation — but it also happens to
> be the easiest setup network-wise, despite only having a 1G link between
> the nodes).
>
> * measure IOPS values per physical device (result was within the
> expectations for HDDs)
> * reinstall OS, reset BIOS, reset HBA configuration  (or actually, switch
> Dell PERC to HBA mode)
>
> Current setup is Ubuntu 24.04 with Linux 6.5. This yields better results
> than 20.04 with some 5.something kernel and Ceph 17 (65 vs. 41 IOPS), but
> all that is still terrible.
>
> We’re also not seeing anything obvious in iostat. Latency is LAN latency
> and normal, no packet loss. MTU 1500 or MTU 9000 literally don’t make a
> difference.
>
> When we disable replication in that setup (pool size=1), we get about 90
> IOPS from the same pool. But there is no special network configuration in
> place. I am attaching a dump of historic OSD ops of an example OSD in the
> cluster for further reference, maybe somebody sees something obvious in
> there.
>
> Best regards
> Martin
>
>
>
> Am 26.11.2024 um 03:43 schrieb Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx>:
>
> Hi Martin,
>
> This is a bit of generic recommendation, but I would go down the path of
> reducing complexity, i.e. first test the drive locally on the OSD node and
> see if there's anything going on with e.g. drive firmware, cables, HBA,
> power.
>
> Then do fio from another host, and this would incorporate networking.
>
> If those look fine, I would do something crazy with Ceph, such as a huge
> number of PGs, or failure domain of OSD, and just deploy a handful of OSDs
> to see if you can bring the problem out in the open.  I would use a default
> setup, with no tweaks to scheduler etc.  Hopefully, you'll get some error
> messages in the logs - ceph logs, syslog, dmesg.  Maybe at that point it
> will become more obvious, or at least some messages will come through that
> will make sense (to you or someone else on the list).
>
> In other words, it seems you have to break this a bit more to get proper
> diagnostics.  I know you guys have played with Ceph before, and can do the
> math of what the IOPS values should be - three clusters all seeing the same
> problem would most likely indicate a non-default configuration value that
> is not correct.
> --
> Alex Gorbachev
> ISS
>
>
>
> On Mon, Nov 25, 2024 at 9:34 PM Martin Gerhard Loschwitz <
> martin.loschwitz@xxxxxxxxxxxxx> wrote:
>
>> Folks,
>>
>> I am getting somewhat desperate debugging multiple setups here within the
>> same environment. Three clusters, two SSD-only, one HDD-only, and what they
>> all have in common is abysmal 4k IOPS performance when measuring with
>> „rados bench“. Abysmal means: In an All-SSD cluster I will get roughly 400
>> IOPS over more than 250 devices. I’ve know SAS-SSDs are not ideal, but 250
>> looks a bit on the low side of things to me.
>>
>> In the second cluster, also All-SSD based, I get roughly 120 4k IOPS. And
>> the HDD-only cluster delivers 60 4k IOPS. The latter both with
>> substantially fewer devices, granted. But even with 20 HDDs, 68 4k IOPS
>> seems like a very bad value to me.
>>
>> I’ve tried to rule out everything I know of: BIOS misconfigurations, HBA
>> problems, networking trouble (I am seeing comparably bad values with a
>> size=1 pool) and so further and so on. But to no avail. Has anybody dealt
>> with something similar on Dell hardware or in general? What could cause
>> such extremely bad benchmark results?
>>
>> I measure with rados bench and qd=1 at 4k block size. „ceph tell osd
>> bench“ with 4k blocks yields 30k+ IOPS for every single device in the big
>> cluster, and all that leads to is 400 IOPS in total when writing to it?
>> Even with no replication in place? That looks a bit off, doesn't it? Any
>> help will be greatly appreciated, thank you very much in advance. Even a
>> pointer to the right direction would be held in high esteem right now.
>> Thank you very much in advance!
>>
>> Best regards
>> Martin
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>
> --
>
> [image: True West IT Services GmbH]
> Martin Gerhard Loschwitz
> Geschäftsführer / CEO, True West IT Services GmbH
> P +49 2433 5253130 <+49%202433%205253130>
> M +49 176 61832178 <https://mysig.io/4ngY23j0>
> A Schmiedegasse 24a, 41836 Hückelhoven, Deutschland
> R HRB 21985, Amtsgericht Mönchengladbach <https://mysig.io/b4g0y3rz>
> <https://mysignature.io/editor?utm_source=expiredpixel>
> True West IT Services GmbH is compliant with the GDPR regulation on data
> protection and privacy in the European Union and the European Economic
> Area. You can request the information on how we collect and process your
> private data according to the law by contacting the email sender.
>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx