Re: Corruption of file systems on RBD images

Mathieu GAUTHIER-LAFAYE <mathieu.gauthier-lafaye@xxxxxxxxxxxxx> · Wed, 2 Sep 2015 18:16:44 +0200 (CEST)

Hi Lionel,

----- Original Message -----
> From: "Lionel Bouton" <lionel+ceph@xxxxxxxxxxx>
> To: "Mathieu GAUTHIER-LAFAYE" <mathieu.gauthier-lafaye@xxxxxxxxxxxxx>, ceph-users@xxxxxxxx
> Sent: Wednesday, 2 September, 2015 4:40:26 PM
> Subject: Re:  Corruption of file systems on RBD images
> 
> Hi Mathieu,
> 
> Le 02/09/2015 14:10, Mathieu GAUTHIER-LAFAYE a écrit :
> > Hi All,
> >
> > We have some troubles regularly with virtual machines using RBD storage.
> > When we restart some virtual machines, they starts to do some filesystem
> > checks. Sometime it can rescue it, sometime the virtual machine die (Linux
> > or Windows).
> 
> What is the cause of death as reported by the VM? FS inconsistency?
> Block device access timeout? ...

The VM starts normally without any error message but when the OS starts it detects some inconsistencies on the filesystem.

It try to repair it (fsck.ext4 or chkdsk.exe)... Few times, the repair was successful and we didn't notice any corruption on the VM but we not checked all the filesystem. The solution is often to reinstall the VM.

> 
> >
> > We have move from Firefly to Hammer the last month. I don't know if the
> > problem is in Ceph and is still there or if we continue to see symptom of
> > a Firefly bug.
> >
> > We have two rooms in two separate building, so we set the replica size to
> > 2. I'm in doubt if it can cause this kind of problems when scrubbing
> > operations. I guess the recommended replica size is at less 3.
> 
> Scrubbing is pretty harmless, deep scrubbing is another matter.
> Simultaneous deep scrubs on the same OSD are a performance killer. It
> seems latest Ceph versions provide some way of limiting its impact on
> performance (scrubs are done per pg so 2 simultaneous scrubs can and
> often involve the same OSD and I think there's a limit on scrubs per OSD
> now). AFAIK Firefly doesn't have this (and it surely didn't when we were
> confronted to the problem) so we developed our own deep scrub scheduler
> to avoid involving the same OSD twice (in fact our scheduler tries to
> interleave scrubs so that each OSD has as much inactivity after a deep
> scrub as possible before the next). This helps a lot.
> 

We have not detected any performance issues due to scrubbing. My doubt was when it check for data integrity of a pg on two replicas. Can it take a wrong decision and replace the good data with the bad one ? I have got probably wrong on how works the scrubbing. Data is safe even if we have only two replicas ?

> >
> > We use BTRFS for OSD with a kernel 3.10. This was not strongly discouraged
> > when we start the deployment of CEPH last year. Now, it seems that the
> > kernel version should be 3.14 or later for this kind of setup.
> 
> See https://btrfs.wiki.kernel.org/index.php/Gotchas for various reasons
> to upgrade.
> 
> We have a good deal of experience with Btrfs in production now. We had
> to disable snapshots, make the journal NoCOW, disable autodefrag and
> develop our own background defragmenter (which converts to zlib at the
> same time it defragments for additional space savings). We currently use
> kernel version 4.0.5 (we don't use any RAID level so we don't need 4.0.6
> to get a fix for an online RAID level conversion bug) and I wouldn't use
> anything less than 3.19.5. The results are pretty good, but Btrfs is
> definitely not an out-of-the-box solution for Ceph.
> 

We do not change any specific options for BTRFS. 

> 
> >
> > Does some people already have got similar problems ? Do you think, it's
> > related to our BTRFS setup. Is it the replica size of the pool ?
> 
> It mainly depends on the answer to the first question above (is it a
> corruption or a freezing problem?).

I hope my answer explain the problem more clearly.

> 
> Lionel
> 

Mathieu
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com