Re: Deprecating ext4 support

Lionel Bouton <lionel-subscription@xxxxxxxxxxx> · Tue, 12 Apr 2016 01:09:44 +0200

Hi,

Le 11/04/2016 23:57, Mark Nelson a écrit :
> [...]
> To add to this on the performance side, we stopped doing regular
> performance testing on ext4 (and btrfs) sometime back around when ICE
> was released to focus specifically on filestore behavior on xfs. 
> There were some cases at the time where ext4 was faster than xfs, but
> not consistently so.  btrfs is often quite fast on fresh fs, but
> degrades quickly due to fragmentation induced by cow with
> small-writes-to-large-object workloads (IE RBD small writes).  If
> btrfs auto-defrag is now safe to use in production it might be worth
> looking at again, but probably not ext4.

For BTRFS, autodefrag is probably not performance-safe (yet), at least
with RBD access patterns. At least it wasn't in 4.1.9 when we tested it
last time (the performance degraded slowly but surely over several weeks
from an initially good performing filesystem to the point where we
measured a 100% increase in average latencies and large spikes and
stopped the experiment). I didn't see any patches on linux-btrfs since
then (it might have benefited from other modifications, but the
autodefrag algorithm wasn't reworked itself AFAIK).
That's not an inherent problem of BTRFS but of the autodefrag
implementation though. Deactivating autodefrag and reimplementing a
basic, cautious defragmentation scheduler gave us noticeably better
latencies with BTRFS vs XFS (~30% better) on the same hardware and
workload long term (as in almost a year and countless full-disk rewrites
on the same filesystems due to both normal writes and rebalancing with 3
to 4 months of XFS and BTRFS OSDs coexisting for comparison purposes).
I'll certainly remount a subset of our OSDs autodefrag as I did with
4.1.9 when we will deploy 4.4.x or a later LTS kernel. So I might have
more up to date information in the coming months. I don't plan to
compare BTRFS to XFS anymore though : XFS only saves us from running our
defragmentation scheduler, BTRFS is far more suited to our workload and
we've seen constant improvements in behavior in the (arguably bumpy
until late 3.19 versions) 3.16.x to 4.1.x road.

Other things:

* If the journal is not on a separate partition (SSD), it should
definitely be re-created NoCoW to avoid unnecessary fragmentation. From
memory : stop OSD, touch journal.new, chattr +C journal.new, dd
if=journal of=journal.new (your dd options here for best perf/least
amount of cache eviction), rm journal, mv journal.new journal, start OSD
again.
* filestore btrfs snap = false
  is mandatory if you want consistent performance (at least on HDDs). It
may not be felt with almost empty OSDs but performance hiccups appear if
any non trivial amount of data is added to the filesystems.
  IIRC, after debugging surprisingly the snapshot creation didn't seem
to be the actual cause of the performance problems but the snapshot
deletion... It's so bad that the default should probably be false and
not true.

Lionel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com