Le 02/09/2015 18:16, Mathieu GAUTHIER-LAFAYE a écrit : > Hi Lionel, > > ----- Original Message ----- >> From: "Lionel Bouton" <lionel+ceph@xxxxxxxxxxx> >> To: "Mathieu GAUTHIER-LAFAYE" <mathieu.gauthier-lafaye@xxxxxxxxxxxxx>, ceph-users@xxxxxxxx >> Sent: Wednesday, 2 September, 2015 4:40:26 PM >> Subject: Re: Corruption of file systems on RBD images >> >> Hi Mathieu, >> >> Le 02/09/2015 14:10, Mathieu GAUTHIER-LAFAYE a écrit : >>> Hi All, >>> >>> We have some troubles regularly with virtual machines using RBD storage. >>> When we restart some virtual machines, they starts to do some filesystem >>> checks. Sometime it can rescue it, sometime the virtual machine die (Linux >>> or Windows). >> What is the cause of death as reported by the VM? FS inconsistency? >> Block device access timeout? ... > The VM starts normally without any error message but when the OS starts it detects some inconsistencies on the filesystem. > > It try to repair it (fsck.ext4 or chkdsk.exe)... Few times, the repair was successful and we didn't notice any corruption on the VM but we not checked all the filesystem. The solution is often to reinstall the VM. Hum. Ceph is pretty good at keeping your data safe (with a small caveat, see below) so you might have some other problem causing data corruption. The first thing coming to mind is that the VM might run on faulty hardware (corrupting data in memory before being written to disk). > [...] > We have not detected any performance issues due to scrubbing. My doubt was when it check for data integrity of a pg on two replicas. Can it take a wrong decision and replace the good data with the bad one ? I have got probably wrong on how works the scrubbing. Data is safe even if we have only two replicas ? I'm not 100% sure. With non-checksumming filesystems, if the primary OSD for a PG is corrupted I believe you are out of luck: AFAIK Ceph doesn't have internal checksums which allows it to detect corruption when reading back data and will give you back what the OSD disk has event if it's corrupted. When repairing a pg (after detecting inconsistencies during a deep scrub) it seems it doesn't try to find the "right" value by vote (ie: if you use size=3, you could choose the data on the 2 "secondary" OSDs even if they match but don't match the primary to correct corruption on the primary) but overwrite secondary OSDs with the data from the primary OSD (which obviously would transmit any corruption from the primary to the secondary OSDs). Then there's a subtlety: with BTRFS and disk corruption the underlying filesystem will return a system error when reading from the primary OSD (because all reads are checked against internal checksums) and I believe Ceph will then switch the read to a secondary OSD to give back valid data to the rbd client. I'm not sure how repairing works in this case: I suspect the data is overwritten by data from the first OSD where a read doesn't fail which would correct the situation without any room for an incorrect choice but the documentation and posts on this subject where not explicit about it. If I'm right (please wait confirmation about Ceph behaviour with Btrfs from developpers), Ceph shouldn't be able to corrupt data from your VM and corruption should happen before it is stored. That said there's a theoretical small window where corruption could occur outside the system running the VM: in the primary OSD contacted by this system if the data to be written is corrupted after being received and before being transmitted to secondary OSDs Ceph itself could corrupt data (due to flaky hardware on some OSDs). This could be protected against by computing a checksum of the data on the rbd client and checking it on all OSD before writing to disk but I don't know the internals/protocols so I don't know if it's done and this window closed. >>> We use BTRFS for OSD with a kernel 3.10. This was not strongly discouraged >>> when we start the deployment of CEPH last year. Now, it seems that the >>> kernel version should be 3.14 or later for this kind of setup. >> See https://btrfs.wiki.kernel.org/index.php/Gotchas for various reasons >> to upgrade. >> >> We have a good deal of experience with Btrfs in production now. We had >> to disable snapshots, make the journal NoCOW, disable autodefrag and >> develop our own background defragmenter (which converts to zlib at the >> same time it defragments for additional space savings). We currently use >> kernel version 4.0.5 (we don't use any RAID level so we don't need 4.0.6 >> to get a fix for an online RAID level conversion bug) and I wouldn't use >> anything less than 3.19.5. The results are pretty good, but Btrfs is >> definitely not an out-of-the-box solution for Ceph. >> > We do not change any specific options for BTRFS. On https://btrfs.wiki.kernel.org/index.php/Gotchas there are unspecified problems with snapshot-aware defrag fixed in 3.10.31 (so if you use autodefrag and <3.10.31 you are probably affected). I believe the consequences are performance and space usage problems but I wouldn't rule out data corruption. If you can, running kernels >=3.19.5 would rule out most Btrfs-related problems other than Ceph-specific performance ones. Then if you have Ceph performance problems you can start to tune it. In the meantime I would check that there aren't any trace of hardware problems in the logs on all your hypervisors. Best regards, Lionel Bouton _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com