Re: Corruption of file systems on RBD images

Lionel Bouton <lionel+ceph@xxxxxxxxxxx> · Wed, 2 Sep 2015 16:40:26 +0200

Hi Mathieu,

Le 02/09/2015 14:10, Mathieu GAUTHIER-LAFAYE a écrit :
> Hi All,
>
> We have some troubles regularly with virtual machines using RBD storage. When we restart some virtual machines, they starts to do some filesystem checks. Sometime it can rescue it, sometime the virtual machine die (Linux or Windows).

What is the cause of death as reported by the VM? FS inconsistency?
Block device access timeout? ...

>
> We have move from Firefly to Hammer the last month. I don't know if the problem is in Ceph and is still there or if we continue to see symptom of a Firefly bug.
>
> We have two rooms in two separate building, so we set the replica size to 2. I'm in doubt if it can cause this kind of problems when scrubbing operations. I guess the recommended replica size is at less 3.

Scrubbing is pretty harmless, deep scrubbing is another matter.
Simultaneous deep scrubs on the same OSD are a performance killer. It
seems latest Ceph versions provide some way of limiting its impact on
performance (scrubs are done per pg so 2 simultaneous scrubs can and
often involve the same OSD and I think there's a limit on scrubs per OSD
now). AFAIK Firefly doesn't have this (and it surely didn't when we were
confronted to the problem) so we developed our own deep scrub scheduler
to avoid involving the same OSD twice (in fact our scheduler tries to
interleave scrubs so that each OSD has as much inactivity after a deep
scrub as possible before the next). This helps a lot.

>
> We use BTRFS for OSD with a kernel 3.10. This was not strongly discouraged when we start the deployment of CEPH last year. Now, it seems that the kernel version should be 3.14 or later for this kind of setup.

See https://btrfs.wiki.kernel.org/index.php/Gotchas for various reasons
to upgrade.

We have a good deal of experience with Btrfs in production now. We had
to disable snapshots, make the journal NoCOW, disable autodefrag and
develop our own background defragmenter (which converts to zlib at the
same time it defragments for additional space savings). We currently use
kernel version 4.0.5 (we don't use any RAID level so we don't need 4.0.6
to get a fix for an online RAID level conversion bug) and I wouldn't use
anything less than 3.19.5. The results are pretty good, but Btrfs is
definitely not an out-of-the-box solution for Ceph.

>
> Does some people already have got similar problems ? Do you think, it's related to our BTRFS setup. Is it the replica size of the pool ?

It mainly depends on the answer to the first question above (is it a
corruption or a freezing problem?).

Lionel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com