Re: Btrfs defragmentation

Mark Nelson <mnelson@xxxxxxxxxx> · Wed, 06 May 2015 13:04:30 -0500

On 05/06/2015 12:51 PM, Lionel Bouton wrote:
On 05/05/15 02:24, Lionel Bouton wrote:
On 05/04/15 01:34, Sage Weil wrote:
On Mon, 4 May 2015, Lionel Bouton wrote:
Hi,

we began testing one Btrfs OSD volume last week and for this first test
we disabled autodefrag and began to launch manual btrfs fi defrag.
[...]
Cool.. let us know how things look after it ages!
[...]

It worked for the past day. Before the algorithm change the Btrfs OSD
disk was the slowest on the system compared to the three XFS ones by a
large margin. This was confirmed both by iostat %util (often at 90-100%)
and monitoring the disk average read/write latencies over time which
often spiked one order of magnitude above the other disks (as high as 3
seconds). Now the Btrfs OSD disk is at least comparable to the other
disks if not a bit faster (comparing latencies).

This is still too early to tell, but very encouraging.

Still going well, I added two new OSDs which are behaving correctly too.

The first of the two has finished catching up. There's a big difference
in the number of extents on XFS and on Btrfs. I've seen files backing
rbd (4MB files with rbd in their names) often have only 1 or 2 extents
on XFS.
On Btrfs they seem to start at 32 extents when they are created and
Btrfs doesn't seem to mind (ie: calling btrfs fi defrag <file> doesn't
reduce the number of extents, at least not in the following 30s where it
should go down). The extents aren't far from each other on disk though,
at least initially.

When my simple algorithm computes the fragmentation cost (the expected
overhead of reading a file vs its optimized version), it seems that just
after finishing catching up (between 3 hours and 1 day depending on the
cluster load and settings), the content is already heavily fragmented
(files are expected to take more than 6x time the read delay than
optimized versions would). Then my defragmentation scheduler manages to
bring down the maximum fragmentation cost (according to its own
definition) by a factor of 0.66 (the very first OSD volume is currently
sitting at a ~4x cost and occasionally reaches the 3.25-3.5 range).

Is there something that would explain why initially Btrfs creates the
4MB files with 128k extents (32 extents / file) ? Is it a bad thing for
performance ?

During normal operation Btrfs OSD volumes continue to behave in the same
way XFS ones do on the same system (sometimes faster/sometimes slower).
What is really slow though it the OSD process startup. I've yet to make
serious tests (umounting the filesystems to clear caches), but I've
already seen 3 minutes of delay reading the pgs. Example :

2015-05-05 16:01:24.854504 7f57c518b780  0 osd.17 22428 load_pgs
2015-05-05 16:01:24.936111 7f57ae7fc700  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-17) destroy_checkpoint:
ioctl SNAP_DESTROY got (2) No such file or directory
2015-05-05 16:01:24.936137 7f57ae7fc700 -1
filestore(/var/lib/ceph/osd/ceph-17) unable to destroy snap
'snap_1671188' got (2) No such file or directory
2015-05-05 16:01:24.991629 7f57ae7fc700  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-17) destroy_checkpoint:
ioctl SNAP_DESTROY got (2) No such file or directory
2015-05-05 16:01:24.991654 7f57ae7fc700 -1
filestore(/var/lib/ceph/osd/ceph-17) unable to destroy snap
'snap_1671189' got (2) No such file or directory
2015-05-05 16:04:25.413110 7f57c518b780  0 osd.17 22428 load_pgs opened
160 pgs

The filesystem might not have reached its balance between fragmentation
and defragmentation rate at this time (so this may change) but mirrors
our initial experience with Btrfs where this was the first symptom of
bad performance.

Out of curiosity, do you see excessive memory usage during 
defragmentation?  Last time I spoke to josef it sounded like it wasn't 
particularly safe yet and could make the machine go OOM, especially if 
there are lots of snapshots.

I've also included some test results from emperor (ie quite old now) 
showcasing how sequential read performance degrades on btrfs after 
random writes are performed (on the 2nd tab you can see how even writes 
are affected as well).  Basically the first iteration of tests look 
great up until random writes are done which causes excessive 
fragmentation due to COW, then subsequent tests are quite bad compared 
to initial BTRFS tests (and XFS).

Your testing is thus quite interesting, especially if it means we can 
reduce this effect.  Keep it up!

Mark

Best regards,

Lionel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Attachment:
Emeror Raw Performance Data.ods

Description: application/vnd.oasis.opendocument.spreadsheet
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com