Re: Btrfs defragmentation

Lionel Bouton <lionel+ceph@xxxxxxxxxxx> · Tue, 05 May 2015 02:24:22 +0200

On 05/04/15 01:34, Sage Weil wrote:
> On Mon, 4 May 2015, Lionel Bouton wrote:
>> Hi,
>>
>> we began testing one Btrfs OSD volume last week and for this first test
>> we disabled autodefrag and began to launch manual btrfs fi defrag.
>> [...]
> Cool.. let us know how things look after it ages!

We had the first signs of Btrfs aging yesterday's morning. Latencies
went up noticeably. The journal was at ~3000 extents back from a maximum
of ~13000 the day before. To verify my assumption that journal
fragmentation was not the cause of latencies, I defragmented it. It took
more than 7 minutes (10GB journal), left it at ~2300 extents (probably
because it was heavily used during the defragmentation) and the high
latencies weren't solved at all.

The initial algorithm selected files to defragment based solely on the
number of extents (files with more extents were processed first). This
was a simple approach to the problem that I hoped would be enough so I
had to make it more clever.

filefrag -v conveniently outputs each fragment relative position on the
device and the total file size. So I changed the algorithm so that it
can still use the result of a periodic find | xargs filefrag call (which
is relatively cheap and ends up fitting in a <100MB Ruby process) but
better model the fragmentation cost.

The new one computes the total cost of reading every file, counting an
initial seek, the total time based on sequential read speed and the time
associated with each seek from one extent to the next (which can be 0
when Btrfs managed to put an extent just after another, or very small if
it is not far from the first on the same HDD track). This total cost is
compared with the ideal defragmented case to know what the speedup could
be after defragmentation. Finally the result is normalized by dividing
it with the total size of each file. The normalization is done because
in the case of RBD (and probably most other uses) what is interesting is
how long a 128kB or 1MB read would take whatever the file and the offset
in the file, not how long a whole file read would take (there's an
assumption that each file as the same probability of being read which
might need to be revisited). There are approximations in the cost
computation and it's HDD centric but it's not very far from reality.

The idea was that it would be able to find the files where fragmentation
is the most painful faster instead of wasting time on less interesting
files. This would make the defragmentation more efficient even if it
didn't process as many files (the less defragmentation takes place the
less load we add).

It worked for the past day. Before the algorithm change the Btrfs OSD
disk was the slowest on the system compared to the three XFS ones by a
large margin. This was confirmed both by iostat %util (often at 90-100%)
and monitoring the disk average read/write latencies over time which
often spiked one order of magnitude above the other disks (as high as 3
seconds). Now the Btrfs OSD disk is at least comparable to the other
disks if not a bit faster (comparing latencies).

This is still too early to tell, but very encouraging.

Best regards,

Lionel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com