On Sa, 2022-12-03 at 01:54 +0100, Boris Behrens wrote: > hi, > maybe someone here can help me to debug an issue we faced today. > > Today one of our clusters came to a grinding halt with 2/3 of our OSDs > reporting slow ops. > Only option to get it back to work fast, was to restart all OSDs daemons. > > The cluster is an octopus cluster with 150 enterprise SSD OSDs. Last work > on the cluster: synced in a node 4 days ago. > > The only health issue, that was reported, was the SLOW_OPS. No slow pings > on the networks. No restarting OSDs. Nothing. > > I was able to ping it to a 20s timeframe and I read ALL the logs in a 20 > minute timeframe around this issue. > > I haven't found any clues. > > Maybe someone encountered this in the past? do you happen to run your rocksdb on a dedicated caching device (nvme ssd)? I observed slow ops in octopus after a faulty nvme ssd was inserted in one ceph server. as was said in other mails, try to isolate your root cause. maybe the node added 4 days ago was the culprit here? we were able to pinpoint the nvme by monitoring the slow osds and the commonality in this case was the same nvme cache device. you should always benchmark new hardware/perform burn-in tests imho, which is not always possible due to environment constraints. -- Mit freundlichen Grüßen / Regards Sven Kieske Systementwickler / systems engineer Mittwald CM Service GmbH & Co. KG Königsberger Straße 4-6 32339 Espelkamp Tel.: 05772 / 293-900 Fax: 05772 / 293-333 https://www.mittwald.de Geschäftsführer: Robert Meyer, Florian Jürgens St.Nr.: 331/5721/1033, USt-IdNr.: DE814773217, HRA 6640, AG Bad Oeynhausen Komplementärin: Robert Meyer Verwaltungs GmbH, HRB 13260, AG Bad Oeynhausen Informationen zur Datenverarbeitung im Rahmen unserer Geschäftstätigkeit gemäß Art. 13-14 DSGVO sind unter www.mittwald.de/ds abrufbar.
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx