On 05/05/15 02:24, Lionel Bouton wrote: > On 05/04/15 01:34, Sage Weil wrote: >> On Mon, 4 May 2015, Lionel Bouton wrote: >>> Hi, >>> >>> we began testing one Btrfs OSD volume last week and for this first test >>> we disabled autodefrag and began to launch manual btrfs fi defrag. >>> [...] >> Cool.. let us know how things look after it ages! > [...] > > > It worked for the past day. Before the algorithm change the Btrfs OSD > disk was the slowest on the system compared to the three XFS ones by a > large margin. This was confirmed both by iostat %util (often at 90-100%) > and monitoring the disk average read/write latencies over time which > often spiked one order of magnitude above the other disks (as high as 3 > seconds). Now the Btrfs OSD disk is at least comparable to the other > disks if not a bit faster (comparing latencies). > > This is still too early to tell, but very encouraging. Still going well, I added two new OSDs which are behaving correctly too. The first of the two has finished catching up. There's a big difference in the number of extents on XFS and on Btrfs. I've seen files backing rbd (4MB files with rbd in their names) often have only 1 or 2 extents on XFS. On Btrfs they seem to start at 32 extents when they are created and Btrfs doesn't seem to mind (ie: calling btrfs fi defrag <file> doesn't reduce the number of extents, at least not in the following 30s where it should go down). The extents aren't far from each other on disk though, at least initially. When my simple algorithm computes the fragmentation cost (the expected overhead of reading a file vs its optimized version), it seems that just after finishing catching up (between 3 hours and 1 day depending on the cluster load and settings), the content is already heavily fragmented (files are expected to take more than 6x time the read delay than optimized versions would). Then my defragmentation scheduler manages to bring down the maximum fragmentation cost (according to its own definition) by a factor of 0.66 (the very first OSD volume is currently sitting at a ~4x cost and occasionally reaches the 3.25-3.5 range). Is there something that would explain why initially Btrfs creates the 4MB files with 128k extents (32 extents / file) ? Is it a bad thing for performance ? During normal operation Btrfs OSD volumes continue to behave in the same way XFS ones do on the same system (sometimes faster/sometimes slower). What is really slow though it the OSD process startup. I've yet to make serious tests (umounting the filesystems to clear caches), but I've already seen 3 minutes of delay reading the pgs. Example : 2015-05-05 16:01:24.854504 7f57c518b780 0 osd.17 22428 load_pgs 2015-05-05 16:01:24.936111 7f57ae7fc700 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-17) destroy_checkpoint: ioctl SNAP_DESTROY got (2) No such file or directory 2015-05-05 16:01:24.936137 7f57ae7fc700 -1 filestore(/var/lib/ceph/osd/ceph-17) unable to destroy snap 'snap_1671188' got (2) No such file or directory 2015-05-05 16:01:24.991629 7f57ae7fc700 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-17) destroy_checkpoint: ioctl SNAP_DESTROY got (2) No such file or directory 2015-05-05 16:01:24.991654 7f57ae7fc700 -1 filestore(/var/lib/ceph/osd/ceph-17) unable to destroy snap 'snap_1671189' got (2) No such file or directory 2015-05-05 16:04:25.413110 7f57c518b780 0 osd.17 22428 load_pgs opened 160 pgs The filesystem might not have reached its balance between fragmentation and defragmentation rate at this time (so this may change) but mirrors our initial experience with Btrfs where this was the first symptom of bad performance. Best regards, Lionel _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com