Re: Fwd: Ceph OSD suicide himself

Lionel Bouton <lionel-subscription@xxxxxxxxxxx> · Tue, 12 Jul 2016 11:23:16 +0200

Hi,

Le 12/07/2016 02:51, Brad Hubbard a écrit :
>  [...]
>>>> This is probably a fragmentation problem : typical rbd access patterns
>>>> cause heavy BTRFS fragmentation.
>>> To the extent that operations take over 120 seconds to complete? Really?
>> Yes, really. I had these too. By default Ceph/RBD uses BTRFS in a very
>> aggressive way, rewriting data all over the place and creating/deleting
>> snapshots every filestore sync interval (5 seconds max by default IIRC).
>>
>> As I said there are 3 main causes of performance degradation :
>> - the snapshots,
>> - the journal in a standard copy-on-write file (move it out of the FS or
>> use NoCow),
>> - the weak auto defragmentation of BTRFS (autodefrag mount option).
>>
>> Each one of them is enough to impact or even destroy performance in the
>> long run. The 3 combined make BTRFS unusable by default. This is why
>> BTRFS is not recommended : if you want to use it you have to be prepared
>> for some (heavy) tuning. The first 2 points are easy to address, for the
>> last (which begins to be noticeable when you accumulate rewrites on your
>> data) I'm not aware of any other tool than the one we developed and
>> published on github (link provided in previous mail).
>>
>> Another thing : you better have a recent 4.1.x or 4.4.x kernel on your
>> OSDs if you use BTRFS. We've used it since 3.19.x but I wouldn't advise
>> it now and would recommend 4.4.x if it's possible for you and 4.1.x
>> otherwise.
> Thanks for the information. I wasn't aware things were that bad with BTRFS as
> I haven't had much to do with it up to this point.

Bad is relative. BTRFS was very time consuming to set up (mainly because
of the defragmentation scheduler development but finding sources of
inefficiency was no picnic either), but once used properly it has 3
unique advantages :
- data checksums : this forces Ceph to use one good replica by refusing
to hand over corrupted data and makes it far easier to handle silent
data corruption (and some of our RAID controllers, probably damaged by
electrical surges, had this nasty habit of flipping bits so it really
was a big time/data saver here),
- compression : you get more space for free,
- speed : we get better latencies than XFS with it.

Until bluestore is production ready (it should address these points even
better than BTRFS does), if I don't find a use case where BTRFS falls on
its face there's no way I'd used anything but BTRFS with Ceph.

Best regards,

Lionel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com