On 05/04/15 01:34, Sage Weil wrote: > On Mon, 4 May 2015, Lionel Bouton wrote: >> Hi, >> >> we began testing one Btrfs OSD volume last week and for this first test >> we disabled autodefrag and began to launch manual btrfs fi defrag. >> [...] > Cool.. let us know how things look after it ages! We had the first signs of Btrfs aging yesterday's morning. Latencies went up noticeably. The journal was at ~3000 extents back from a maximum of ~13000 the day before. To verify my assumption that journal fragmentation was not the cause of latencies, I defragmented it. It took more than 7 minutes (10GB journal), left it at ~2300 extents (probably because it was heavily used during the defragmentation) and the high latencies weren't solved at all. The initial algorithm selected files to defragment based solely on the number of extents (files with more extents were processed first). This was a simple approach to the problem that I hoped would be enough so I had to make it more clever. filefrag -v conveniently outputs each fragment relative position on the device and the total file size. So I changed the algorithm so that it can still use the result of a periodic find | xargs filefrag call (which is relatively cheap and ends up fitting in a <100MB Ruby process) but better model the fragmentation cost. The new one computes the total cost of reading every file, counting an initial seek, the total time based on sequential read speed and the time associated with each seek from one extent to the next (which can be 0 when Btrfs managed to put an extent just after another, or very small if it is not far from the first on the same HDD track). This total cost is compared with the ideal defragmented case to know what the speedup could be after defragmentation. Finally the result is normalized by dividing it with the total size of each file. The normalization is done because in the case of RBD (and probably most other uses) what is interesting is how long a 128kB or 1MB read would take whatever the file and the offset in the file, not how long a whole file read would take (there's an assumption that each file as the same probability of being read which might need to be revisited). There are approximations in the cost computation and it's HDD centric but it's not very far from reality. The idea was that it would be able to find the files where fragmentation is the most painful faster instead of wasting time on less interesting files. This would make the defragmentation more efficient even if it didn't process as many files (the less defragmentation takes place the less load we add). It worked for the past day. Before the algorithm change the Btrfs OSD disk was the slowest on the system compared to the three XFS ones by a large margin. This was confirmed both by iostat %util (often at 90-100%) and monitoring the disk average read/write latencies over time which often spiked one order of magnitude above the other disks (as high as 3 seconds). Now the Btrfs OSD disk is at least comparable to the other disks if not a bit faster (comparing latencies). This is still too early to tell, but very encouraging. Best regards, Lionel _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com