On Mon, 4 May 2015, Lionel Bouton wrote: > Hi, > > we began testing one Btrfs OSD volume last week and for this first test > we disabled autodefrag and began to launch manual btrfs fi defrag. > > During the tests, I monitored the number of extents of the journal > (10GB) and it went through the roof (it currently sits at 8000+ extents > for example). > I was tempted to defragment it but after thinking a bit about it I think > it might not be a good idea. > With Btrfs, by default the data written to the journal on disk isn't > copied to its final destination. Ceph is using a clone_range feature to > reference the same data instead of copying it. We've discussed this possibility but have never implemented it. The data is written twice: once to the journal and once to the object file. > So if you defragment both the journal and the final destination, you are > moving the data around to attempt to get both references to satisfy a > one extent goal but most of the time can't get both of them at the same > time (unless the destination is a whole file instead of a fragment of one). > > I assume the journal probably doesn't benefit at all from > defragmentation: it's overwritten constantly and as Btrfs uses CoW, the > previous extents won't be reused at all and new ones will be created for > the new data instead of overwritting the old in place. The final > destination files are reused (reread) and benefit from defragmentation. Yeah, I agree. It is probably best to let btrfs write the journal anywhere since it is never read (except for replay after a failure or restart). There is also a newish 'journal discard' option that is false by default; enabling this may let us thorw out the previously allocated space so that the new writes get written to fresh locations (instead of to the previously written and fragmented positions). I expect this will make a positive difference, but I'm not sure that anyone has tested it. > Under these assumptions we excluded the journal file from > defragmentation, in fact we only defragment the "current" directory > (snapshot directories are probably only read from in rare cases and are > ephemeral so optimizing them is not interesting). > > The filesystem is only one week old so we will have to wait a bit to see > if this strategy is better than the one used when mounting with > autodefrag (I couldn't find much about it but last year we had > unmanageable latencies). Cool.. let us know how things look after it ages! sage > We have a small Ruby script which triggers defragmentation based on the > number of extents and by default limits the rate of calls to btrfs fi > defrag to a negligible level to avoid trashing the filesystem. If > someone is interested I can attach it or push it on Github after a bit > of cleanup. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com