Hi list, Excuse me, what I'm saying is off topic @Lionel, if you use btrfs, did you already try to use btrfs compression for OSD? If yes, сan you share the your experience? 2015-05-05 3:24 GMT+03:00 Lionel Bouton <lionel+ceph@xxxxxxxxxxx>: > On 05/04/15 01:34, Sage Weil wrote: >> On Mon, 4 May 2015, Lionel Bouton wrote: >>> Hi, >>> >>> we began testing one Btrfs OSD volume last week and for this first test >>> we disabled autodefrag and began to launch manual btrfs fi defrag. >>> [...] >> Cool.. let us know how things look after it ages! > > We had the first signs of Btrfs aging yesterday's morning. Latencies > went up noticeably. The journal was at ~3000 extents back from a maximum > of ~13000 the day before. To verify my assumption that journal > fragmentation was not the cause of latencies, I defragmented it. It took > more than 7 minutes (10GB journal), left it at ~2300 extents (probably > because it was heavily used during the defragmentation) and the high > latencies weren't solved at all. > > The initial algorithm selected files to defragment based solely on the > number of extents (files with more extents were processed first). This > was a simple approach to the problem that I hoped would be enough so I > had to make it more clever. > > filefrag -v conveniently outputs each fragment relative position on the > device and the total file size. So I changed the algorithm so that it > can still use the result of a periodic find | xargs filefrag call (which > is relatively cheap and ends up fitting in a <100MB Ruby process) but > better model the fragmentation cost. > > The new one computes the total cost of reading every file, counting an > initial seek, the total time based on sequential read speed and the time > associated with each seek from one extent to the next (which can be 0 > when Btrfs managed to put an extent just after another, or very small if > it is not far from the first on the same HDD track). This total cost is > compared with the ideal defragmented case to know what the speedup could > be after defragmentation. Finally the result is normalized by dividing > it with the total size of each file. The normalization is done because > in the case of RBD (and probably most other uses) what is interesting is > how long a 128kB or 1MB read would take whatever the file and the offset > in the file, not how long a whole file read would take (there's an > assumption that each file as the same probability of being read which > might need to be revisited). There are approximations in the cost > computation and it's HDD centric but it's not very far from reality. > > The idea was that it would be able to find the files where fragmentation > is the most painful faster instead of wasting time on less interesting > files. This would make the defragmentation more efficient even if it > didn't process as many files (the less defragmentation takes place the > less load we add). > > It worked for the past day. Before the algorithm change the Btrfs OSD > disk was the slowest on the system compared to the three XFS ones by a > large margin. This was confirmed both by iostat %util (often at 90-100%) > and monitoring the disk average read/write latencies over time which > often spiked one order of magnitude above the other disks (as high as 3 > seconds). Now the Btrfs OSD disk is at least comparable to the other > disks if not a bit faster (comparing latencies). > > This is still too early to tell, but very encouraging. > > Best regards, > > Lionel > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Have a nice day, Timofey. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com