On 05/06/15 20:28, Lionel Bouton wrote: > Hi, > > On 05/06/15 20:07, Timofey Titovets wrote: >> 2015-05-06 20:51 GMT+03:00 Lionel Bouton <lionel+ceph@xxxxxxxxxxx>: >>> Is there something that would explain why initially Btrfs creates the >>> 4MB files with 128k extents (32 extents / file) ? Is it a bad thing for >>> performance ? >> This kind of behaviour is a reason why i ask you about compression. >> "You can use filefrag to locate heavily fragmented files (may not work >> correctly with compression)." >> https://btrfs.wiki.kernel.org/index.php/Gotchas >> >> Filefrag show each compressed chunk as separated extents, but he can >> be located linear. This is a problem in file frag =\ > Hum, I see. This could explain why we rarely see the number of extents > go down. When data is replaced with incompressible data Btrfs must > deactivate compression and be able to reduce the number of extents. > > This should not have much impact on the defragmentation process and > performance: we check for extents being written sequentially next to > each other and don't count this as a cost for file access. This is why > these files aren't defragmented even if we ask for it and our tool > reports a low overhead for them. Here's more information, especially about compression. 1/ filefrag behaviour. I use our tool to trace the fragmentation evolution after launching btrfs fi defrag on each file (it calls filefrag -v asynchronously every 5 seconds until the defragmentation seems done). filefrag output doesn't understand compression and doesn't seem to have access to the latest on-disk layout. - for compression, you can have a reported layout where an extent begins in the middle of the previous pretty often. So I assume the physical offset of the extent start is good but the end is computed from the extent decompressed length (it's always 32x4096-bytes blocks which matches the compression block size). We had to compensate for that because we erroneously considered this case needing a seek although it doesn't. This means you can't trust the number of extents reported by filefrag -v (it is supposed to merge consecutive extents when run with -v). - for access to the layout, I assume Btrfs reports what is committed to disk. I base this assumption on the fact that for all defragmented files, filefrag -v output becomes stable in at most 30 seconds after the "btrfs fi defrag" command returns (30 seconds is the default commit interval for Btrfs). There's something odd going on with the 'shared' flag reported by filefrag too: I assumed this was linked to clone_range or snapshots and most of the time it seems so but on other (non-OSD) filesystems I found files with this flag on extents and I couldn't find any explanation for it. 2/ compression influence on fragmentation Even after compensating for filefrag -v errors, Btrfs clearly has more difficulties defragmenting compressed files. At least our model for computing the cost associated with a particular layout reports fewer gains when defragmenting a compressed file. In our configuration and according to our model of disk latencies we seem to hit a limit where file reads cost ~ 2.75x what it would if the files where in an ideal, sequential layout. If we try to go lower the majority of the files don't benefit at all from defragmentation (the resulting layout isn't better than the initial one). Note that this doesn't account for NCQ/TCQ : we suppose the read is isolated. So in practice reading from multiple threads should be less costly and the OSD might not suffer much from this. In fact 2 out of the 3 BTRFS OSD have lower latencies than most of the rest of the cluster even with our tool slowly checking files and triggering defragmentations in the background. 3/ History/heavy backfilling seems to have a large influence on performance As I said 2 out of our 3 BTRFS OSDs have a very good behavior. Unfortunately the third doesn't. This is the OSD where our tool was deactivated during most of the initial backfilling process. It doesn't have the most data, the most writes or the most reads of the group but it had by far the worst latencies these last two days. I even checked the disk for hardware problems and couldn't find any. I don't have a clear explanation for the performance difference. Maybe the 2.75x overhead target isn't low enough and this OSD has more fragmented files than the others bellow this target (we don't compute the average fragmentation yet). This would mean than we can expect the performance of the 2 others to slowly degrade over time (so the test isn't conclusive yet). I've decided to remount this particular OSD without compression and let our tool slowly bring down the maximum overhead to 1.5x (which should be doable as without compression files are more easily defragmented) while using primary-affinity = 0. I'll revert to primary-affinity 1 when the defragmentation is done and see how the OSD/disk behave. 4/ autodefrag doesn't like Ceph OSDs According to our previous experience by now all our Btrfs OSDs should be on their knees begging us to shot them down: there's clearly something to gain by tuning the defragmentation process. I suspect that autodefrag either takes too much time trying to defragment the journal and/or is overwhelmed by the amount of fragmentation going on and skip defragmentations randomly instead of focusing on the most fragmented files. Best regards, Lionel _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com