Re: Lost data from a RBD while client was not connected

J-P Methot <jp.methot@xxxxxxxxxxxxxxxxx> · Wed, 4 Aug 2021 11:40:26 -0400

I'm replying to my own message as it appears we have "fixed" the issue. 
Basically, we restarted all OSD hosts and all the presumed lost data 
reappeared. It's likely that some OSDs were stuck unreachable but were 
somehow never flagged as such in the cluster.

On 8/3/21 8:15 PM, J-P Methot wrote:
Hi,

We've encountered this issue on Ceph Pacific, with an Openstack 
Wallaby cluster hooked to it. Essentially, we're slowly pushing this 
setup into production so we're testing it and encountered this oddity. 
My colleague wanted to do some network redundancy tests, so he 
manually shutdown a rbd-backed VM in openstack and then started 
shutting down network switches. This didn't go well and caused 
instabilities on the network, with potential packet loss. When he 
fixed the problem, he started the VM back up and the filesystem was 
corrupt and non-recoverable. There was no activity from Ceph clients 
while the tests were going on. There's no errors in Ceph status. No 
missing PGs or objects are reported. As far as Ceph is concerned, it 
believes that there's no issue, despite this RBD mysteriously getting 
corrupt.

So to recap:
1.Clean shutdown of VM in openstack

2.Network tests cause downtime and packet loss.

3.VM activates but can't boot. Black screen in console.

4.Investigation shows that the XFS filesystem on the VM's sda1 is 
unrecoverable by xfs_repair.

So, my question is, when there is no client activity, can data in the 
cluster still get corrupt and unrecoverable if there is network 
instability? Or is the cause something else?

--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx