Martin, are MONs set up on the same hosts, or is there latency to them by any chance? -- Alex Gorbachev https://alextelescope.blogspot.com On Tue, Nov 26, 2024 at 5:20 AM Martin Gerhard Loschwitz < martin.loschwitz@xxxxxxxxxxxxx> wrote: > Hi Alex, > > thank you for the reply. Here are all the steps we’ve done in the last > weeks to reduce complexity (we’re focussing on the HDD cluster for now in > which we are seeing the worst results in relation — but it also happens to > be the easiest setup network-wise, despite only having a 1G link between > the nodes). > > * measure IOPS values per physical device (result was within the > expectations for HDDs) > * reinstall OS, reset BIOS, reset HBA configuration (or actually, switch > Dell PERC to HBA mode) > > Current setup is Ubuntu 24.04 with Linux 6.5. This yields better results > than 20.04 with some 5.something kernel and Ceph 17 (65 vs. 41 IOPS), but > all that is still terrible. > > We’re also not seeing anything obvious in iostat. Latency is LAN latency > and normal, no packet loss. MTU 1500 or MTU 9000 literally don’t make a > difference. > > When we disable replication in that setup (pool size=1), we get about 90 > IOPS from the same pool. But there is no special network configuration in > place. I am attaching a dump of historic OSD ops of an example OSD in the > cluster for further reference, maybe somebody sees something obvious in > there. > > Best regards > Martin > > > > Am 26.11.2024 um 03:43 schrieb Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx>: > > Hi Martin, > > This is a bit of generic recommendation, but I would go down the path of > reducing complexity, i.e. first test the drive locally on the OSD node and > see if there's anything going on with e.g. drive firmware, cables, HBA, > power. > > Then do fio from another host, and this would incorporate networking. > > If those look fine, I would do something crazy with Ceph, such as a huge > number of PGs, or failure domain of OSD, and just deploy a handful of OSDs > to see if you can bring the problem out in the open. I would use a default > setup, with no tweaks to scheduler etc. Hopefully, you'll get some error > messages in the logs - ceph logs, syslog, dmesg. Maybe at that point it > will become more obvious, or at least some messages will come through that > will make sense (to you or someone else on the list). > > In other words, it seems you have to break this a bit more to get proper > diagnostics. I know you guys have played with Ceph before, and can do the > math of what the IOPS values should be - three clusters all seeing the same > problem would most likely indicate a non-default configuration value that > is not correct. > -- > Alex Gorbachev > ISS > > > > On Mon, Nov 25, 2024 at 9:34 PM Martin Gerhard Loschwitz < > martin.loschwitz@xxxxxxxxxxxxx> wrote: > >> Folks, >> >> I am getting somewhat desperate debugging multiple setups here within the >> same environment. Three clusters, two SSD-only, one HDD-only, and what they >> all have in common is abysmal 4k IOPS performance when measuring with >> „rados bench“. Abysmal means: In an All-SSD cluster I will get roughly 400 >> IOPS over more than 250 devices. I’ve know SAS-SSDs are not ideal, but 250 >> looks a bit on the low side of things to me. >> >> In the second cluster, also All-SSD based, I get roughly 120 4k IOPS. And >> the HDD-only cluster delivers 60 4k IOPS. The latter both with >> substantially fewer devices, granted. But even with 20 HDDs, 68 4k IOPS >> seems like a very bad value to me. >> >> I’ve tried to rule out everything I know of: BIOS misconfigurations, HBA >> problems, networking trouble (I am seeing comparably bad values with a >> size=1 pool) and so further and so on. But to no avail. Has anybody dealt >> with something similar on Dell hardware or in general? What could cause >> such extremely bad benchmark results? >> >> I measure with rados bench and qd=1 at 4k block size. „ceph tell osd >> bench“ with 4k blocks yields 30k+ IOPS for every single device in the big >> cluster, and all that leads to is 400 IOPS in total when writing to it? >> Even with no replication in place? That looks a bit off, doesn't it? Any >> help will be greatly appreciated, thank you very much in advance. Even a >> pointer to the right direction would be held in high esteem right now. >> Thank you very much in advance! >> >> Best regards >> Martin >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> > > -- > > [image: True West IT Services GmbH] > Martin Gerhard Loschwitz > Geschäftsführer / CEO, True West IT Services GmbH > P +49 2433 5253130 <+49%202433%205253130> > M +49 176 61832178 <https://mysig.io/4ngY23j0> > A Schmiedegasse 24a, 41836 Hückelhoven, Deutschland > R HRB 21985, Amtsgericht Mönchengladbach <https://mysig.io/b4g0y3rz> > <https://mysignature.io/editor?utm_source=expiredpixel> > True West IT Services GmbH is compliant with the GDPR regulation on data > protection and privacy in the European Union and the European Economic > Area. You can request the information on how we collect and process your > private data according to the law by contacting the email sender. > > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx