Perhaps run "iostat -xtcy <list of OSD devices> 5" on the OSD hosts to see if any of the drives have weirdly high utilization despite low iops/requests? Den tis 6 dec. 2022 kl 10:02 skrev Boris Behrens <bb@xxxxxxxxx>: > > Hi Sven, > I am searching really hard for defect hardware, but I am currently out of > ideas: > - checked prometheus stats, but in all that data I don't know what to look > for (osd apply latency if very low at the mentioned point and went up to > 40ms after all OSDs were restarted) > - smartctl shows nothing > - dmesg show nothing > - network data shows nothing > - osd and clusterlogs show nothing > > If anybody got a good tip what I can check, that would be awesome. A string > in the logs (I made a copy from that days logs), or a tool to fire against > the hardware. I am 100% out of ideas what it could be. > In a time frame of 20s 2/3 of our OSDs went from "all fine" to "I am > waiting for the replicas to do their work" (log message 'waiting for sub > ops'). But there was no alert that any OSD had connection problems to other > OSDs. Additional the cluster_network is the same interface, switch, > everything as public_network. Only difference is the VLAN id (I plan to > remove the cluster_network because it does not provide anything for us). > > I am also planning to update all hosts from centos7 to ubuntu 20.04 (newer > kernel, standardized OS config and so on). > > Am Mo., 5. Dez. 2022 um 14:24 Uhr schrieb Sven Kieske <S.Kieske@xxxxxxxxxxx > >: > > > On Sa, 2022-12-03 at 01:54 +0100, Boris Behrens wrote: > > > hi, > > > maybe someone here can help me to debug an issue we faced today. > > > > > > Today one of our clusters came to a grinding halt with 2/3 of our OSDs > > > reporting slow ops. > > > Only option to get it back to work fast, was to restart all OSDs daemons. > > > > > > The cluster is an octopus cluster with 150 enterprise SSD OSDs. Last work > > > on the cluster: synced in a node 4 days ago. > > > > > > The only health issue, that was reported, was the SLOW_OPS. No slow pings > > > on the networks. No restarting OSDs. Nothing. > > > > > > I was able to ping it to a 20s timeframe and I read ALL the logs in a 20 > > > minute timeframe around this issue. > > > > > > I haven't found any clues. > > > > > > Maybe someone encountered this in the past? > > > > do you happen to run your rocksdb on a dedicated caching device (nvme ssd)? > > > > I observed slow ops in octopus after a faulty nvme ssd was inserted in one > > ceph server. > > as was said in other mails, try to isolate your root cause. > > > > maybe the node added 4 days ago was the culprit here? > > > > we were able to pinpoint the nvme by monitoring the slow osds > > and the commonality in this case was the same nvme cache device. > > > > you should always benchmark new hardware/perform burn-in tests imho, which > > is not always possible due to environment constraints. > > > > -- > > Mit freundlichen Grüßen / Regards > > > > Sven Kieske > > Systementwickler / systems engineer > > > > > > Mittwald CM Service GmbH & Co. KG > > Königsberger Straße 4-6 > > 32339 Espelkamp > > > > Tel.: 05772 / 293-900 > > Fax: 05772 / 293-333 > > > > https://www.mittwald.de > > > > Geschäftsführer: Robert Meyer, Florian Jürgens > > > > St.Nr.: 331/5721/1033, USt-IdNr.: DE814773217, HRA 6640, AG Bad Oeynhausen > > Komplementärin: Robert Meyer Verwaltungs GmbH, HRB 13260, AG Bad Oeynhausen > > > > Informationen zur Datenverarbeitung im Rahmen unserer Geschäftstätigkeit > > gemäß Art. 13-14 DSGVO sind unter www.mittwald.de/ds abrufbar. > > > > > > -- > Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im > groüen Saal. > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx -- May the most significant bit of your life be positive. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx