Re: Btrfs defragmentation

Sage Weil <sage@xxxxxxxxxxxx> · Sun, 3 May 2015 16:34:18 -0700 (PDT)

On Mon, 4 May 2015, Lionel Bouton wrote:
> Hi,
> 
> we began testing one Btrfs OSD volume last week and for this first test
> we disabled autodefrag and began to launch manual btrfs fi defrag.
> 
> During the tests, I monitored the number of extents of the journal
> (10GB) and it went through the roof (it currently sits at 8000+ extents
> for example).
> I was tempted to defragment it but after thinking a bit about it I think
> it might not be a good idea.
> With Btrfs, by default the data written to the journal on disk isn't
> copied to its final destination. Ceph is using a clone_range feature to
> reference the same data instead of copying it.

We've discussed this possibility but have never implemented it.  The data 
is written twice: once to the journal and once to the object file.

> So if you defragment both the journal and the final destination, you are
> moving the data around to attempt to get both references to satisfy a
> one extent goal but most of the time can't get both of them at the same
> time (unless the destination is a whole file instead of a fragment of one).
> 
> I assume the journal probably doesn't benefit at all from
> defragmentation: it's overwritten constantly and as Btrfs uses CoW, the
> previous extents won't be reused at all and new ones will be created for
> the new data instead of overwritting the old in place. The final
> destination files are reused (reread) and benefit from defragmentation.

Yeah, I agree.  It is probably best to let btrfs write the journal 
anywhere since it is never read (except for replay after a failure 
or restart).

There is also a newish 'journal discard' option that is false by default; 
enabling this may let us thorw out the previously allocated space so that 
the new writes get written to fresh locations (instead of to the 
previously written and fragmented positions).  I expect this will make a 
positive difference, but I'm not sure that anyone has tested it.

> Under these assumptions we excluded the journal file from
> defragmentation, in fact we only defragment the "current" directory
> (snapshot directories are probably only read from in rare cases and are
> ephemeral so optimizing them is not interesting).
> 
> The filesystem is only one week old so we will have to wait a bit to see
> if this strategy is better than the one used when mounting with
> autodefrag (I couldn't find much about it but last year we had
> unmanageable latencies).

Cool.. let us know how things look after it ages!

sage

> We have a small Ruby script which triggers defragmentation based on the
> number of extents and by default limits the rate of calls to btrfs fi
> defrag to a negligible level to avoid trashing the filesystem. If
> someone is interested I can attach it or push it on Github after a bit
> of cleanup.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com