Re: Corruption of file systems on RBD images

Lionel Bouton <lionel+ceph@xxxxxxxxxxx> · Wed, 2 Sep 2015 19:30:33 +0200

Le 02/09/2015 18:16, Mathieu GAUTHIER-LAFAYE a écrit :
> Hi Lionel,
>
> ----- Original Message -----
>> From: "Lionel Bouton" <lionel+ceph@xxxxxxxxxxx>
>> To: "Mathieu GAUTHIER-LAFAYE" <mathieu.gauthier-lafaye@xxxxxxxxxxxxx>, ceph-users@xxxxxxxx
>> Sent: Wednesday, 2 September, 2015 4:40:26 PM
>> Subject: Re:  Corruption of file systems on RBD images
>>
>> Hi Mathieu,
>>
>> Le 02/09/2015 14:10, Mathieu GAUTHIER-LAFAYE a écrit :
>>> Hi All,
>>>
>>> We have some troubles regularly with virtual machines using RBD storage.
>>> When we restart some virtual machines, they starts to do some filesystem
>>> checks. Sometime it can rescue it, sometime the virtual machine die (Linux
>>> or Windows).
>> What is the cause of death as reported by the VM? FS inconsistency?
>> Block device access timeout? ...
> The VM starts normally without any error message but when the OS starts it detects some inconsistencies on the filesystem.
>
> It try to repair it (fsck.ext4 or chkdsk.exe)... Few times, the repair was successful and we didn't notice any corruption on the VM but we not checked all the filesystem. The solution is often to reinstall the VM.

Hum. Ceph is pretty good at keeping your data safe (with a small caveat,
see below) so you might have some other problem causing data corruption.
The first thing coming to mind is that the VM might run on faulty
hardware (corrupting data in memory before being written to disk).

> [...]
> We have not detected any performance issues due to scrubbing. My doubt was when it check for data integrity of a pg on two replicas. Can it take a wrong decision and replace the good data with the bad one ? I have got probably wrong on how works the scrubbing. Data is safe even if we have only two replicas ?

I'm not 100% sure. With non-checksumming filesystems, if the primary OSD
for a PG is corrupted I believe you are out of luck: AFAIK Ceph doesn't
have internal checksums which allows it to detect corruption when
reading back data and will give you back what the OSD disk has event if
it's corrupted. When repairing a pg (after detecting inconsistencies
during a deep scrub) it seems it doesn't try to find the "right" value
by vote (ie: if you use size=3, you could choose the data on the 2
"secondary" OSDs even if they match but don't match the primary to
correct corruption on the primary) but overwrite secondary OSDs with the
data from the primary OSD (which obviously would transmit any corruption
from the primary to the secondary OSDs).

Then there's a subtlety: with BTRFS and disk corruption the underlying
filesystem will return a system error when reading from the primary OSD
(because all reads are checked against internal checksums) and I believe
Ceph will then switch the read to a secondary OSD to give back valid
data to the rbd client. I'm not sure how repairing works in this case: I
suspect the data is overwritten by data from the first OSD where a read
doesn't fail which would correct the situation without any room for an
incorrect choice but the documentation and posts on this subject where
not explicit about it.

If I'm right (please wait confirmation about Ceph behaviour with Btrfs
from developpers), Ceph shouldn't be able to corrupt data from your VM
and corruption should happen before it is stored.
That said there's a theoretical small window where corruption could
occur outside the system running the VM: in the primary OSD contacted by
this system if the data to be written is corrupted after being received
and before being transmitted to secondary OSDs Ceph itself could corrupt
data (due to flaky hardware on some OSDs). This could be protected
against by computing a checksum of the data on the rbd client and
checking it on all OSD before writing to disk but I don't know the
internals/protocols so I don't know if it's done and this window closed.

>>> We use BTRFS for OSD with a kernel 3.10. This was not strongly discouraged
>>> when we start the deployment of CEPH last year. Now, it seems that the
>>> kernel version should be 3.14 or later for this kind of setup.
>> See https://btrfs.wiki.kernel.org/index.php/Gotchas for various reasons
>> to upgrade.
>>
>> We have a good deal of experience with Btrfs in production now. We had
>> to disable snapshots, make the journal NoCOW, disable autodefrag and
>> develop our own background defragmenter (which converts to zlib at the
>> same time it defragments for additional space savings). We currently use
>> kernel version 4.0.5 (we don't use any RAID level so we don't need 4.0.6
>> to get a fix for an online RAID level conversion bug) and I wouldn't use
>> anything less than 3.19.5. The results are pretty good, but Btrfs is
>> definitely not an out-of-the-box solution for Ceph.
>>
> We do not change any specific options for BTRFS.

On https://btrfs.wiki.kernel.org/index.php/Gotchas there are unspecified problems with snapshot-aware defrag fixed in 3.10.31 (so if you use autodefrag and <3.10.31 you are probably affected). I believe the consequences are performance and space usage problems but I wouldn't rule out data corruption.

If you can, running kernels >=3.19.5 would rule out most Btrfs-related
problems other than Ceph-specific performance ones. Then if you have
Ceph performance problems you can start to tune it.

In the meantime I would check that there aren't any trace of hardware
problems in the logs on all your hypervisors.

Best regards,

Lionel Bouton
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com