On Mo, 2021-09-20 at 10:29 -0500, Mark Nelson wrote: > At least in one case for us, the user was using consumer grade SSDs > without power loss protection. I don't think we ever fully diagnosed if > that was the cause though. Another case potentially was related to high > memory usage on the node. Hardware errors are a legitimate concern here > so probably checking dmesg/smartctl/etc is warranted. ECC memory > obviously helps too (or rather the lack of which makes it more difficult > to diagnose). > > > For folks that have experienced this, any info you can give related to > the HW involved would be helpful. We (and other projects) have seen > similar things over the years but this is a notoriously difficult issue > to track down given that it could be any one of many different things > and it may or may not be our code. > Hi, maybe I can help debug this and you can help me too! We run 14.2.10 in pre production and I'm fairly confident we hit this bug: https://tracker.ceph.com/issues/37282 This is a ubuntu based ceph ansible deployment using enterprise SSD with power loss protection. We see random and rare osd crashes (in ceph crash ls) distributed through our 140 OSD erasure coded cluster. This is an all flash ssd cluster with metadata on nvme ssd storage. As I said, these are enterprise ssd from Intel (S4610) and Samsung(MZWLL1T6HAJQ). I already did bluestore fsck (deep) and repair. I see no Hardware Errors at all, not even small issues with SMART etc. This did started happening some time after we upgraded the cluster from 14.2.6 to 14.2.10, fwiw. We also somewhat agressivly pushed the osd_memory_target up after that so I feared this might cause crashes if OSDs die due to OOM (see e.g. this for a report: https://www.mail-archive.com/search?l=ceph-users%40lists.ceph.com&q=subject:%22%5C%5Bceph%5C-users%5C%5D+OSD+crash+after+change+of+osd_memory_target%22&o=newest&f=1 ). I'm currently in the process of lowering the osd_memory_target again. We had no crashes since the beginning of september. if you need more information about the past crashes I can provide logs etc. -- Mit freundlichen Grüßen / Regards Sven Kieske Systementwickler / systems engineer Mittwald CM Service GmbH & Co. KG Königsberger Straße 4-6 32339 Espelkamp Tel.: 05772 / 293-900 Fax: 05772 / 293-333 https://www.mittwald.de Geschäftsführer: Robert Meyer, Florian Jürgens St.Nr.: 331/5721/1033, USt-IdNr.: DE814773217, HRA 6640, AG Bad Oeynhausen Komplementärin: Robert Meyer Verwaltungs GmbH, HRB 13260, AG Bad Oeynhausen Informationen zur Datenverarbeitung im Rahmen unserer Geschäftstätigkeit gemäß Art. 13-14 DSGVO sind unter www.mittwald.de/ds abrufbar.
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx