Re: rocksdb corruption with 16.2.6

Sven Kieske <S.Kieske@xxxxxxxxxxx> · Tue, 21 Sep 2021 15:09:02 +0000

On Mo, 2021-09-20 at 10:29 -0500, Mark Nelson wrote:
> At least in one case for us, the user was using consumer grade SSDs 
> without power loss protection.  I don't think we ever fully diagnosed if 
> that was the cause though.  Another case potentially was related to high 
> memory usage on the node.  Hardware errors are a legitimate concern here 
> so probably checking dmesg/smartctl/etc is warranted.  ECC memory 
> obviously helps too (or rather the lack of which makes it more difficult 
> to diagnose).
> 
> 
> For folks that have experienced this, any info you can give related to 
> the HW involved would be helpful.  We (and other projects) have seen 
> similar things over the years but this is a notoriously difficult issue 
> to track down given that it could be any one of many different things 
> and it may or may not be our code.
> 

Hi,

maybe I can help debug this and you can help me too!

We run 14.2.10 in pre production and I'm fairly confident we hit this bug:

https://tracker.ceph.com/issues/37282

This is a ubuntu based ceph ansible deployment using enterprise SSD with power loss
protection.

We see random and rare osd crashes (in ceph crash ls) distributed through our
140 OSD erasure coded cluster.

This is an all flash ssd cluster with metadata on nvme ssd storage.

As I said, these are enterprise ssd from Intel (S4610) and Samsung(MZWLL1T6HAJQ).

I already did bluestore fsck (deep) and repair. I see no Hardware Errors at all, not even
small issues with SMART etc.

This did started happening some time after we upgraded the cluster from 14.2.6 to 14.2.10, fwiw.

We also somewhat agressivly pushed the osd_memory_target up after that so I feared this might
cause crashes if OSDs die due to OOM (see e.g. this for a
report: https://www.mail-archive.com/search?l=ceph-users%40lists.ceph.com&q=subject:%22%5C%5Bceph%5C-users%5C%5D+OSD+crash+after+change+of+osd_memory_target%22&o=newest&f=1 ).

I'm currently in the process of lowering the osd_memory_target again.
We had no crashes since the beginning of september.

if you need more information about the past crashes I can provide logs etc. 

-- 
Mit freundlichen Grüßen / Regards

Sven Kieske
Systementwickler / systems engineer

Mittwald CM Service GmbH & Co. KG
Königsberger Straße 4-6
32339 Espelkamp

Tel.: 05772 / 293-900
Fax: 05772 / 293-333

https://www.mittwald.de

Geschäftsführer: Robert Meyer, Florian Jürgens

St.Nr.: 331/5721/1033, USt-IdNr.: DE814773217, HRA 6640, AG Bad Oeynhausen
Komplementärin: Robert Meyer Verwaltungs GmbH, HRB 13260, AG Bad Oeynhausen

Informationen zur Datenverarbeitung im Rahmen unserer Geschäftstätigkeit 
gemäß Art. 13-14 DSGVO sind unter www.mittwald.de/ds abrufbar.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx