Re: 4k IOPS: miserable performance in All-SSD cluster

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Mon, 25 Nov 2024 22:48:13 -0500

Good insights from Alex.  

Are these clusters all new? Or have they been around a while, previously happier?

One idea that comes to mind is an MTU mismatch between hosts and switches, or some manner of bonding misalignment.  What does `netstat -I` show?  `ethtool -S`?  I’m thinking that maybe just maybe bonding (if present) is awry in some fashion such that half of packets in/out disappear into the twilight zone. Like if LACP appears up on the host but a switch issue dooms all packets on one link, in or out.  

> On Nov 25, 2024, at 9:45 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote:
> 
> Hi Martin,
> 
> This is a bit of generic recommendation, but I would go down the path of
> reducing complexity, i.e. first test the drive locally on the OSD node and
> see if there's anything going on with e.g. drive firmware, cables, HBA,
> power.
> 
> Then do fio from another host, and this would incorporate networking.
> 
> If those look fine, I would do something crazy with Ceph, such as a huge
> number of PGs, or failure domain of OSD, and just deploy a handful of OSDs
> to see if you can bring the problem out in the open.  I would use a default
> setup, with no tweaks to scheduler etc.  Hopefully, you'll get some error
> messages in the logs - ceph logs, syslog, dmesg.  Maybe at that point it
> will become more obvious, or at least some messages will come through that
> will make sense (to you or someone else on the list).
> 
> In other words, it seems you have to break this a bit more to get proper
> diagnostics.  I know you guys have played with Ceph before, and can do the
> math of what the IOPS values should be - three clusters all seeing the same
> problem would most likely indicate a non-default configuration value that
> is not correct.
> --
> Alex Gorbachev
> ISS
> 
> 
> 
>> On Mon, Nov 25, 2024 at 9:34 PM Martin Gerhard Loschwitz <
>> martin.loschwitz@xxxxxxxxxxxxx> wrote:
>> 
>> Folks,
>> 
>> I am getting somewhat desperate debugging multiple setups here within the
>> same environment. Three clusters, two SSD-only, one HDD-only, and what they
>> all have in common is abysmal 4k IOPS performance when measuring with
>> „rados bench“. Abysmal means: In an All-SSD cluster I will get roughly 400
>> IOPS over more than 250 devices. I’ve know SAS-SSDs are not ideal, but 250
>> looks a bit on the low side of things to me.
>> 
>> In the second cluster, also All-SSD based, I get roughly 120 4k IOPS. And
>> the HDD-only cluster delivers 60 4k IOPS. The latter both with
>> substantially fewer devices, granted. But even with 20 HDDs, 68 4k IOPS
>> seems like a very bad value to me.
>> 
>> I’ve tried to rule out everything I know of: BIOS misconfigurations, HBA
>> problems, networking trouble (I am seeing comparably bad values with a
>> size=1 pool) and so further and so on. But to no avail. Has anybody dealt
>> with something similar on Dell hardware or in general? What could cause
>> such extremely bad benchmark results?
>> 
>> I measure with rados bench and qd=1 at 4k block size. „ceph tell osd
>> bench“ with 4k blocks yields 30k+ IOPS for every single device in the big
>> cluster, and all that leads to is 400 IOPS in total when writing to it?
>> Even with no replication in place? That looks a bit off, doesn't it? Any
>> help will be greatly appreciated, thank you very much in advance. Even a
>> pointer to the right direction would be held in high esteem right now.
>> Thank you very much in advance!
>> 
>> Best regards
>> Martin
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx