Re: HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized;

Lionel Bouton <lionel+ceph@xxxxxxxxxxx> · Mon, 29 Dec 2014 23:16:13 +0100

On 12/29/14 15:49, Thomas Lemarchand wrote:
> I too dislike the fact that it's not "native" (ie developed inside the
> Linux Kernel), and this is why I'm not sure this project is a good
> solution.
>
> The userbase is necessarily much lower it would be if this was native,
> so less tests, less feedbacks, and potentially less security.
>
> When I use ZFS on FreeBSD, I know it's widely used and tested.
>
> Since you can have multiple backend FS for your OSD inside a Ceph
> cluster, what I do know is a mix between your alternatives 1 and 2.
>
> XFS for now, and upgrade to BTRFS once it is ready.
>
> On a test cluster (1 MON, 6 OSDs), I started with XFS (for a few
> months), then moved it to BTRFS (without losing a single bit) for a few
> months, then had a problem with BTRFS snapshots (without playing with
> any kind of snapshot in Ceph, weird)

Hi,

Ceph OSDs use BTRFS snapshots automatically. OSDs create and destroy
snapshots at a relatively high rate and - according to recent answers on
this list - much higher than what is expected by the BTRFS developpers.
It seems that it prevents the BTRFS autodefragmenter to catch up leading
to heavily fragmented OSDs.

I wonder how well BTRFS would work for OSDs if Ceph devs would disable
snapshots on it. I guess it would prevent the current neat trick on
BTRFS of using a single write for both the journal and the data
directory updates but we could at least benefit from the lzo/zlib
compression which would help both performance and capacity. This would
probably be a far more stable platform too: all the BTRFS bugs we
encountered were triggered by snapshosts and/or the way Ceph uses
snapshots on BTRFS.

For people testing BTRFS OSD, you might want to disable the
autodefragmenter and schedule periodic defragmentations, this more brute
force approach *might* work much better than relying on the
autodefragmenter heuristics. According to my last tests with BTRFS OSDs,
performance degrade slowly: on our setup and with our usage pattern if
manual defragmentation solved the fragmentation problem launching it
once per week would have been more than enough to maintain performance
above what XFS provides (with a dedicated journal partition) on the same
hardware.
The last time I checked there were some BTRFS stability bugs in various
kernel versions, the most stable kernel version for us (where we
couldn't break BTRFS on more than 10+ OSDs with a moderately high load)
was 3.16.4 (3.17.0 and 3.17.1 had a nasty bug which remounted the fs
read-only on occasion).

Currently we have a pure XFS setup but I will probably test this
strategy with additional OSDs the next time we raise our capacity. The
benefits are hard to ignore: journal writes are "free" on BTRFS (I
suppose there is a bit of overhead for creating the snapshots making
this possible but it's most probably far less than writing the same data
twice) and lzo works great for us (giving us 20-30% additional space and
most probably a little performance advantage too)

Best regards,

Lionel Bouton
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com