On 05/06/2015 12:51 PM, Lionel Bouton wrote:
On 05/05/15 02:24, Lionel Bouton wrote:On 05/04/15 01:34, Sage Weil wrote:On Mon, 4 May 2015, Lionel Bouton wrote:Hi, we began testing one Btrfs OSD volume last week and for this first test we disabled autodefrag and began to launch manual btrfs fi defrag. [...]Cool.. let us know how things look after it ages![...] It worked for the past day. Before the algorithm change the Btrfs OSD disk was the slowest on the system compared to the three XFS ones by a large margin. This was confirmed both by iostat %util (often at 90-100%) and monitoring the disk average read/write latencies over time which often spiked one order of magnitude above the other disks (as high as 3 seconds). Now the Btrfs OSD disk is at least comparable to the other disks if not a bit faster (comparing latencies). This is still too early to tell, but very encouraging.Still going well, I added two new OSDs which are behaving correctly too. The first of the two has finished catching up. There's a big difference in the number of extents on XFS and on Btrfs. I've seen files backing rbd (4MB files with rbd in their names) often have only 1 or 2 extents on XFS. On Btrfs they seem to start at 32 extents when they are created and Btrfs doesn't seem to mind (ie: calling btrfs fi defrag <file> doesn't reduce the number of extents, at least not in the following 30s where it should go down). The extents aren't far from each other on disk though, at least initially. When my simple algorithm computes the fragmentation cost (the expected overhead of reading a file vs its optimized version), it seems that just after finishing catching up (between 3 hours and 1 day depending on the cluster load and settings), the content is already heavily fragmented (files are expected to take more than 6x time the read delay than optimized versions would). Then my defragmentation scheduler manages to bring down the maximum fragmentation cost (according to its own definition) by a factor of 0.66 (the very first OSD volume is currently sitting at a ~4x cost and occasionally reaches the 3.25-3.5 range). Is there something that would explain why initially Btrfs creates the 4MB files with 128k extents (32 extents / file) ? Is it a bad thing for performance ? During normal operation Btrfs OSD volumes continue to behave in the same way XFS ones do on the same system (sometimes faster/sometimes slower). What is really slow though it the OSD process startup. I've yet to make serious tests (umounting the filesystems to clear caches), but I've already seen 3 minutes of delay reading the pgs. Example : 2015-05-05 16:01:24.854504 7f57c518b780 0 osd.17 22428 load_pgs 2015-05-05 16:01:24.936111 7f57ae7fc700 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-17) destroy_checkpoint: ioctl SNAP_DESTROY got (2) No such file or directory 2015-05-05 16:01:24.936137 7f57ae7fc700 -1 filestore(/var/lib/ceph/osd/ceph-17) unable to destroy snap 'snap_1671188' got (2) No such file or directory 2015-05-05 16:01:24.991629 7f57ae7fc700 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-17) destroy_checkpoint: ioctl SNAP_DESTROY got (2) No such file or directory 2015-05-05 16:01:24.991654 7f57ae7fc700 -1 filestore(/var/lib/ceph/osd/ceph-17) unable to destroy snap 'snap_1671189' got (2) No such file or directory 2015-05-05 16:04:25.413110 7f57c518b780 0 osd.17 22428 load_pgs opened 160 pgs The filesystem might not have reached its balance between fragmentation and defragmentation rate at this time (so this may change) but mirrors our initial experience with Btrfs where this was the first symptom of bad performance.
Out of curiosity, do you see excessive memory usage during defragmentation? Last time I spoke to josef it sounded like it wasn't particularly safe yet and could make the machine go OOM, especially if there are lots of snapshots.
I've also included some test results from emperor (ie quite old now) showcasing how sequential read performance degrades on btrfs after random writes are performed (on the 2nd tab you can see how even writes are affected as well). Basically the first iteration of tests look great up until random writes are done which causes excessive fragmentation due to COW, then subsequent tests are quite bad compared to initial BTRFS tests (and XFS).
Your testing is thus quite interesting, especially if it means we can reduce this effect. Keep it up!
Mark
Best regards, Lionel _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Attachment:
Emeror Raw Performance Data.ods
Description: application/vnd.oasis.opendocument.spreadsheet
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com