On 04/15/2015 10:36 AM, Jake Young wrote:
On Wednesday, April 15, 2015, Mark Nelson <mnelson@xxxxxxxxxx <mailto:mnelson@xxxxxxxxxx>> wrote: On 04/15/2015 08:16 AM, Jake Young wrote: Has anyone compiled ceph (either osd or client) on a Solaris based OS? The thread on ZFS support for osd got me thinking about using solaris as an osd server. It would have much better ZFS performance and I wonder if the osd performance without a journal would be 2x better. Doubt it. You may be able to do a little better, but you have to pay the piper some how. If you clone from journal you will introduce fragmentation. If you throw the journal away you'll suffer for everything but very large writes unless you throw safety away. I think if we are going to generally beat filestore (not just for optimal benchmarking tests!) it's going to take some very careful cleverness. Thankfully Sage is very clever and is working on it in newstore. Even there, filestore has been proving difficult to beat for writes. That's interesting. I've been under the impression that the ideal osd config was using a stable and fast BTRFS (which doesn't exist yet) with no journal.
This is sort of unrelated to the journal specifically, but BTRFS with RBD will start fragmenting terribly due to how COW works (and how it relates to snapshots too). More related to the journal: At one point we were thinking about cloning from the journal on BTRFS, but that also potentially leads to nasty fragmentation even if the initial behavior would look very good. I haven't done any testing that I can remember of BTRFS with no journal. I'm not sure if it even still works...
In my specific case, I don't want to use an external journal. I've gone down the path of using RAID controllers with write-back cache and BBUs with each disk in its own RAID0 group, instead of SSD journals. (Thanks for your performance articles BTW, they were very helpful!) My take on your results indicates that IO throughput performance on XFS with same disk journal and WB cache on the RAID card was basically the same or better than BTRFS with no journal. In addition, BTRFS typically used much more CPU. Has BTRFS performance gotten any better since you wrote the performance articles?
So the trick with those articles is that the systems are fresh, and most of the initial articles were using rados bench which is always writing out new objects vs something like RBD where you are (usually) doing writes to existing objects that represent the blocks. If you were to do a bunch of random 4k writes and then later try to do sequential reads, you'd see BTRFS sequential read performance tank. We actually did tests like that with emperor during the firefly development cycle. I've included the results. Basically the first iteration of the test cycle looks great on BTRFS, then you see read performance drop way down. Eventually write performance also is likely drop as the disks become extremely fragmented (we may even see a little of that in those tests).
Have you compared ZFS (ZoL) performance to BTRFS?
I did way back in 2013 when we were working with Brian Behlendorf to fix xattr bugs in ZOL. It was quite a bit slower if you didn't enable SA xattrs. With SA xattrs, it was much closer, but not as fast as btrfs or xfs. I didn't do a lot of tuning though and Ceph wasn't making good use of ZFS features, so it's very possible things have changed.
Attachment:
Emeror Raw Performance Data.ods
Description: application/vnd.oasis.opendocument.spreadsheet
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com