Re: Btrfs defragmentation

Lionel Bouton <lionel+ceph@xxxxxxxxxxx> · Tue, 12 May 2015 12:17:09 +0200

On 05/06/15 20:28, Lionel Bouton wrote:
> Hi,
>
> On 05/06/15 20:07, Timofey Titovets wrote:
>> 2015-05-06 20:51 GMT+03:00 Lionel Bouton <lionel+ceph@xxxxxxxxxxx>:
>>> Is there something that would explain why initially Btrfs creates the
>>> 4MB files with 128k extents (32 extents / file) ? Is it a bad thing for
>>> performance ?
>> This kind of behaviour is a reason why i ask you about compression.
>> "You can use filefrag to locate heavily fragmented files (may not work
>> correctly with compression)."
>> https://btrfs.wiki.kernel.org/index.php/Gotchas
>>
>> Filefrag show each compressed chunk as separated extents, but he can
>> be located linear. This is a problem in file frag =\
> Hum, I see. This could explain why we rarely see the number of extents
> go down. When data is replaced with incompressible data Btrfs must
> deactivate compression and be able to reduce the number of extents.
>
> This should not have much impact on the defragmentation process and
> performance: we check for extents being written sequentially next to
> each other and don't count this as a cost for file access. This is why
> these files aren't defragmented even if we ask for it and our tool
> reports a low overhead for them.

Here's more information, especially about compression.

1/ filefrag behaviour.

I use our tool to trace the fragmentation evolution after launching
btrfs fi defrag on each file (it calls filefrag -v asynchronously every
5 seconds until the defragmentation seems done).
filefrag output doesn't understand compression and doesn't seem to have
access to the latest on-disk layout.

- for compression, you can have a reported layout where an extent begins
in the middle of the previous pretty often. So I assume the physical
offset of the extent start is good but the end is computed from the
extent decompressed length (it's always 32x4096-bytes blocks which
matches the compression block size). We had to compensate for that
because we erroneously considered this case needing a seek although it
doesn't. This means you can't trust the number of extents reported by
filefrag -v (it is supposed to merge consecutive extents when run with -v).
- for access to the layout, I assume Btrfs reports what is committed to
disk. I base this assumption on the fact that for all defragmented
files, filefrag -v output becomes stable in at most 30 seconds after the
"btrfs fi defrag" command returns (30 seconds is the default commit
interval for Btrfs).

There's something odd going on with the 'shared' flag reported by
filefrag too: I assumed this was linked to clone_range or snapshots and
most of the time it seems so but on other (non-OSD) filesystems I found
files with this flag on extents and I couldn't find any explanation for it.

2/ compression influence on fragmentation

Even after compensating for filefrag -v errors, Btrfs clearly has more
difficulties defragmenting compressed files. At least our model for
computing the cost associated with a particular layout reports fewer
gains when defragmenting a compressed file. In our configuration and
according to our model of disk latencies we seem to hit a limit where
file reads cost ~ 2.75x what it would if the files where in an ideal,
sequential layout. If we try to go lower the majority of the files don't
benefit at all from defragmentation (the resulting layout isn't better
than the initial one).
Note that this doesn't account for NCQ/TCQ : we suppose the read is
isolated. So in practice reading from multiple threads should be less
costly and the OSD might not suffer much from this.
In fact 2 out of the 3 BTRFS OSD have lower latencies than most of the
rest of the cluster even with our tool slowly checking files and
triggering defragmentations in the background.

3/ History/heavy backfilling seems to have a large influence on performance

As I said 2 out of our 3 BTRFS OSDs have a very good behavior.
Unfortunately the third doesn't. This is the OSD where our tool was
deactivated during most of the initial backfilling process. It doesn't
have the most data, the most writes or the most reads of the group but
it had by far the worst latencies these last two days. I even checked
the disk for hardware problems and couldn't find any.
I don't have a clear explanation for the performance difference. Maybe
the 2.75x overhead target isn't low enough and this OSD has more
fragmented files than the others bellow this target (we don't compute
the average fragmentation yet). This would mean than we can expect the
performance of the 2 others to slowly degrade over time (so the test
isn't conclusive yet).

I've decided to remount this particular OSD without compression and let
our tool slowly bring down the maximum overhead to 1.5x (which should be
doable as without compression files are more easily defragmented) while
using primary-affinity = 0. I'll revert to primary-affinity 1 when the
defragmentation is done and see how the OSD/disk behave.

4/ autodefrag doesn't like Ceph OSDs

According to our previous experience by now all our Btrfs OSDs should be
on their knees begging us to shot them down: there's clearly something
to gain by tuning the defragmentation process. I suspect that autodefrag
either takes too much time trying to defragment the journal and/or is
overwhelmed by the amount of fragmentation going on and skip
defragmentations randomly instead of focusing on the most fragmented files.

Best regards,

Lionel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com