Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

Igor Fedotov <igor.fedotov@xxxxxxxx> · Mon, 24 Jan 2022 16:42:22 +0300

Hey Sebastian,

thanks a lot for the update, please see more questions inline.

Thanks,

Igor

On 1/22/2022 2:13 AM, Sebastian Mazza wrote:
Hey Igor,

thank you for your response and your suggestions.

I've tried to simulate every imaginable load that the cluster might have done before the three OSD crashed.
I rebooted the servers many times while the Custer was under load. If more than a single node was rebooted at the same time, the client hangs until enough servers are up again. Which is perfectly fine!
I really tried hard to crash it, but I failed. Wich is excellent in general, but unfortunately not helpful for finding the root cause of the problem with the corrupted Rocks DBs.
And you haven't made any environsment/config changes, e.g. disk caching disablement, since the last issue, right?
There is an environmental change, since I’m currently missing one of my two ethernet switches for the cluster. The switches (should) provide a MLAG for every server, so every server uses a linux interface bond that is connected with one cable to each switch. However, one of the switches is currently for RMA because it sporadically failed to (re)boot. I did not change anything at the network config of the server, but of corse the linux bond driver is currently not able to balance the network traffic across two link, since only one is active. Could this have an influence?
Except from disconnecting half of the network cables I did not change anything. Alle the HDDs are the same and are inserted into the same drive bays.

Configuration wise I’m not aware of any change. I did only destroy and recreate the 3 failed OSDs.

I did now checke the write cache settings of all HDDs by `hdparm -W /dev/sdX` which always returns “write-caching =  1 (on)”.
I did also check the OSD setting “bluefs_buffered_io” by `ceph daemon osd.X config show | grep bluefs_buffered_io` which returned true for all OSDs.
I’m pretty sure that all this caches was always on.

Do you suggest to disable the HDD write-caching and / or the bluefs_buffered_io for productive clusters?

Generally upstream recommendation is to disable disk write caching, 
there were multiple complains it might negatively impact the performance 
in some setups.

As for bluefs_buffered_io - please keep it on, the disablmement is known 
to cause performance drop.

When rebooting a node  - did you perform it by regular OS command (reboot or poweroff) or by a power switch?
I never did a hard reset or used the power switch. I used `init 6` for performing a reboot. Each server has redundant power supplies with one connected to a battery backup and the other to the grid. Therefore, I do think that none of the servers ever faced a non clean shutdown or reboot.

So the original reboot which caused the failures was made in the same 
manner, right?
Best regards,
Sebastian

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx