Re: 4k IOPS: miserable performance in All-SSD cluster

Martin Gerhard Loschwitz <martin.loschwitz@xxxxxxxxxxxxx> · Tue, 26 Nov 2024 11:22:53 +0100

Hi Anthony,

I think problems have always been like this, albeit these setups are a bit older already. We’ve specifically set the MTU to 9000 on both switches and all affected machines, but MTU 1500 or MTU 9000 literally doesn’t make a difference.

Network is non-LACP on one of the test clusters (the HDD cluster with the worst hardware). It’s a single 1G link, but that should not be a problem for an idling cluster during a normal 4k IOPS test, should it?

Best regards
Martin

> Am 26.11.2024 um 04:48 schrieb Anthony D'Atri <anthony.datri@xxxxxxxxx>:
> 
> Good insights from Alex.  
> 
> Are these clusters all new? Or have they been around a while, previously happier?
> 
> One idea that comes to mind is an MTU mismatch between hosts and switches, or some manner of bonding misalignment.  What does `netstat -I` show?  `ethtool -S`?  I’m thinking that maybe just maybe bonding (if present) is awry in some fashion such that half of packets in/out disappear into the twilight zone. Like if LACP appears up on the host but a switch issue dooms all packets on one link, in or out.  
> 
>> On Nov 25, 2024, at 9:45 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote:
>> 
>> Hi Martin,
>> 
>> This is a bit of generic recommendation, but I would go down the path of
>> reducing complexity, i.e. first test the drive locally on the OSD node and
>> see if there's anything going on with e.g. drive firmware, cables, HBA,
>> power.
>> 
>> Then do fio from another host, and this would incorporate networking.
>> 
>> If those look fine, I would do something crazy with Ceph, such as a huge
>> number of PGs, or failure domain of OSD, and just deploy a handful of OSDs
>> to see if you can bring the problem out in the open.  I would use a default
>> setup, with no tweaks to scheduler etc.  Hopefully, you'll get some error
>> messages in the logs - ceph logs, syslog, dmesg.  Maybe at that point it
>> will become more obvious, or at least some messages will come through that
>> will make sense (to you or someone else on the list).
>> 
>> In other words, it seems you have to break this a bit more to get proper
>> diagnostics.  I know you guys have played with Ceph before, and can do the
>> math of what the IOPS values should be - three clusters all seeing the same
>> problem would most likely indicate a non-default configuration value that
>> is not correct.
>> --
>> Alex Gorbachev
>> ISS
>> 
>> 
>> 
>>> On Mon, Nov 25, 2024 at 9:34 PM Martin Gerhard Loschwitz <
>>> martin.loschwitz@xxxxxxxxxxxxx> wrote:
>>> 
>>> Folks,
>>> 
>>> I am getting somewhat desperate debugging multiple setups here within the
>>> same environment. Three clusters, two SSD-only, one HDD-only, and what they
>>> all have in common is abysmal 4k IOPS performance when measuring with
>>> „rados bench“. Abysmal means: In an All-SSD cluster I will get roughly 400
>>> IOPS over more than 250 devices. I’ve know SAS-SSDs are not ideal, but 250
>>> looks a bit on the low side of things to me.
>>> 
>>> In the second cluster, also All-SSD based, I get roughly 120 4k IOPS. And
>>> the HDD-only cluster delivers 60 4k IOPS. The latter both with
>>> substantially fewer devices, granted. But even with 20 HDDs, 68 4k IOPS
>>> seems like a very bad value to me.
>>> 
>>> I’ve tried to rule out everything I know of: BIOS misconfigurations, HBA
>>> problems, networking trouble (I am seeing comparably bad values with a
>>> size=1 pool) and so further and so on. But to no avail. Has anybody dealt
>>> with something similar on Dell hardware or in general? What could cause
>>> such extremely bad benchmark results?
>>> 
>>> I measure with rados bench and qd=1 at 4k block size. „ceph tell osd
>>> bench“ with 4k blocks yields 30k+ IOPS for every single device in the big
>>> cluster, and all that leads to is 400 IOPS in total when writing to it?
>>> Even with no replication in place? That looks a bit off, doesn't it? Any
>>> help will be greatly appreciated, thank you very much in advance. Even a
>>> pointer to the right direction would be held in high esteem right now.
>>> Thank you very much in advance!
>>> 
>>> Best regards
>>> Martin
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> 
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx

-- 

Martin Gerhard Loschwitz
Geschäftsführer / CEO, True West IT Services GmbH
P +49 2433 5253130 <tel:+49 2433 5253130>
M +49 176 61832178 <https://mysig.io/4ngY23j0>
A Schmiedegasse 24a, 41836 Hückelhoven, Deutschland
R HRB 21985, Amtsgericht Mönchengladbach <https://mysig.io/b4g0y3rz>
 <https://mysignature.io/editor?utm_source=expiredpixel>
True West IT Services GmbH is compliant with the GDPR regulation on data protection and privacy in the European Union and the European Economic Area. You can request the information on how we collect and process your private data according to the law by contacting the email sender.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx