Hi Sven, I am searching really hard for defect hardware, but I am currently out of ideas: - checked prometheus stats, but in all that data I don't know what to look for (osd apply latency if very low at the mentioned point and went up to 40ms after all OSDs were restarted) - smartctl shows nothing - dmesg show nothing - network data shows nothing - osd and clusterlogs show nothing If anybody got a good tip what I can check, that would be awesome. A string in the logs (I made a copy from that days logs), or a tool to fire against the hardware. I am 100% out of ideas what it could be. In a time frame of 20s 2/3 of our OSDs went from "all fine" to "I am waiting for the replicas to do their work" (log message 'waiting for sub ops'). But there was no alert that any OSD had connection problems to other OSDs. Additional the cluster_network is the same interface, switch, everything as public_network. Only difference is the VLAN id (I plan to remove the cluster_network because it does not provide anything for us). I am also planning to update all hosts from centos7 to ubuntu 20.04 (newer kernel, standardized OS config and so on). Am Mo., 5. Dez. 2022 um 14:24 Uhr schrieb Sven Kieske <S.Kieske@xxxxxxxxxxx >: > On Sa, 2022-12-03 at 01:54 +0100, Boris Behrens wrote: > > hi, > > maybe someone here can help me to debug an issue we faced today. > > > > Today one of our clusters came to a grinding halt with 2/3 of our OSDs > > reporting slow ops. > > Only option to get it back to work fast, was to restart all OSDs daemons. > > > > The cluster is an octopus cluster with 150 enterprise SSD OSDs. Last work > > on the cluster: synced in a node 4 days ago. > > > > The only health issue, that was reported, was the SLOW_OPS. No slow pings > > on the networks. No restarting OSDs. Nothing. > > > > I was able to ping it to a 20s timeframe and I read ALL the logs in a 20 > > minute timeframe around this issue. > > > > I haven't found any clues. > > > > Maybe someone encountered this in the past? > > do you happen to run your rocksdb on a dedicated caching device (nvme ssd)? > > I observed slow ops in octopus after a faulty nvme ssd was inserted in one > ceph server. > as was said in other mails, try to isolate your root cause. > > maybe the node added 4 days ago was the culprit here? > > we were able to pinpoint the nvme by monitoring the slow osds > and the commonality in this case was the same nvme cache device. > > you should always benchmark new hardware/perform burn-in tests imho, which > is not always possible due to environment constraints. > > -- > Mit freundlichen Grüßen / Regards > > Sven Kieske > Systementwickler / systems engineer > > > Mittwald CM Service GmbH & Co. KG > Königsberger Straße 4-6 > 32339 Espelkamp > > Tel.: 05772 / 293-900 > Fax: 05772 / 293-333 > > https://www.mittwald.de > > Geschäftsführer: Robert Meyer, Florian Jürgens > > St.Nr.: 331/5721/1033, USt-IdNr.: DE814773217, HRA 6640, AG Bad Oeynhausen > Komplementärin: Robert Meyer Verwaltungs GmbH, HRB 13260, AG Bad Oeynhausen > > Informationen zur Datenverarbeitung im Rahmen unserer Geschäftstätigkeit > gemäß Art. 13-14 DSGVO sind unter www.mittwald.de/ds abrufbar. > > -- Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im groüen Saal. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx