Re: Possible data corruption with 14.2.3 and 14.2.4

Igor Fedotov <ifedotov@xxxxxxx> · Mon, 18 Nov 2019 13:43:44 +0300

Hi Simon,

On 11/15/2019 6:02 PM, Simon Ironside wrote:
Hi Igor,

On 15/11/2019 14:22, Igor Fedotov wrote:

Do you mean both standalone DB and(!!) standalone WAL 
devices/partitions by having SSD DB/WAL?

No, 1x combined DB/WAL partition on an SSD and 1x data partition on an 
HDD per OSD. I.e. created like:

ceph-deploy osd create --data /dev/sda --block-db ssd0/ceph-db-disk0
ceph-deploy osd create --data /dev/sdb --block-db ssd0/ceph-db-disk1
ceph-deploy osd create --data /dev/sdc --block-db ssd0/ceph-db-disk2

--block-wal wasn't used.

If so then BlueFS might eventually overwrite some data at you DB 
volume with BlueFS log content. Which most probably makes OSD crash 
and unable to restart one day. This is quite random and not very 
frequent event which is to some degree dependent from cluster 
loading. And the period between actual data corruption and any 
evidence of this is non-zero most of the time - we tend to see it 
mostly when RocksDB was performing compaction.

So this, if I've understood you correctly, is for those with 3 
separate (DB + WAL + Data) devices per OSD. Not my setup.

right
Other OSD configuration which might suffer from the issue is main 
device + WAL devices.

Much less failure probability exists for main + DB layout. It 
requires almost full DB to get any chances to appear.

This sounds like my setup: 2 separate (DB/WAL combined + Data) devices 
per OSD.
yep

Main-only device configurations aren't under the threat as far as I 
can tell.

And this is for all-in-one devices that aren't at risk. Understood.

While we're waiting for 14.2.5 to be released, what should 14.2.3/4 
users with an at risk setup do in the meantime, if anything?

- Check how full their DB devices are?
For your case it makes sense to check this. And then safely wait for 
14.2.5 if its not full.
- Avoid adding new data/load to the cluster?
this is probably the last resort when you already start seeing this 
issue and is absolutely uncomfortable with data loss probability. Not a 
panacea anyway though as one can have already broken data but still 
undiscovered data corruption at multiple OSDs b.
- Would deep scrubbing detect any undiscovered corruption?

May be. We tend to see it during DB compaction (mostly triggered by DB 
write access) but IMO it can be detected during scrubbing and/or store 
fsck as well.

- Get backups ready to restore? I mean, how bad is this?

As per multiple reports there are some chances to lose OSD data. E.g. 
we've got reports about reproducing 1-2 OSD failures per day under some 
stress(!!!) loading testing. That's probably not the general case and 
production clusters might suffer from this much less frequently. E.g. 
for our multiple QA activities we've observed the issue just once since 
it had been introduced.

Anyway it's possible to lose multiple OSDs simultaneously. Probability 
is rather not that large but it's definitely non-zero.

But as fix is almost ready I'd recommend to wait for it and apply ASAP.

Thanks,
Simon.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx