Re: Ceph on Solaris / Illumos

Mark Nelson <mnelson@xxxxxxxxxx> · Wed, 15 Apr 2015 11:22:25 -0500

On 04/15/2015 10:36 AM, Jake Young wrote:

On Wednesday, April 15, 2015, Mark Nelson <mnelson@xxxxxxxxxx
<mailto:mnelson@xxxxxxxxxx>> wrote:

    On 04/15/2015 08:16 AM, Jake Young wrote:

        Has anyone compiled ceph (either osd or client) on a Solaris
        based OS?

        The thread on ZFS support for osd got me thinking about using
        solaris as
        an osd server. It would have much better ZFS performance and I
        wonder if
        the osd performance without a journal would be 2x better.

    Doubt it.  You may be able to do a little better, but you have to
    pay the piper some how.  If you clone from journal you will
    introduce fragmentation.  If you throw the journal away you'll
    suffer for everything but very large writes unless you throw safety
    away.  I think if we are going to generally beat filestore (not just
    for optimal benchmarking tests!) it's going to take some very
    careful cleverness. Thankfully Sage is very clever and is working on
    it in newstore. Even there, filestore has been proving difficult to
    beat for writes.

That's interesting. I've been under the impression that the ideal
osd config was using a stable and fast BTRFS (which doesn't exist
yet) with no journal.

This is sort of unrelated to the journal specifically, but BTRFS with 
RBD will start fragmenting terribly due to how COW works (and how it 
relates to snapshots too).  More related to the journal:  At one point 
we were thinking about cloning from the journal on BTRFS, but that also 
potentially leads to nasty fragmentation even if the initial behavior 
would look very good.  I haven't done any testing that I can remember of 
BTRFS with no journal.  I'm not sure if it even still works...

In my specific case, I don't want to use an external journal. I've gone
down the path of using RAID controllers with write-back cache and BBUs
with each disk in its own RAID0 group, instead of SSD journals. (Thanks
for your performance articles BTW, they were very helpful!)

My take on your results indicates that IO throughput performance on XFS
with same disk journal and WB cache on the RAID card was basically the
same or better than BTRFS with no journal.  In addition, BTRFS typically
used much more CPU.

Has BTRFS performance gotten any better since you wrote the performance
articles?

So the trick with those articles is that the systems are fresh, and most 
of the initial articles were using rados bench which is always writing 
out new objects vs something like RBD where you are (usually) doing 
writes to existing objects that represent the blocks.  If you were to do 
a bunch of random 4k writes and then later try to do sequential reads, 
you'd see BTRFS sequential read performance tank.  We actually did tests 
like that with emperor during the firefly development cycle.  I've 
included the results. Basically the first iteration of the test cycle 
looks great on BTRFS, then you see read performance drop way down. 
Eventually write performance also is likely drop as the disks become 
extremely fragmented (we may even see a little of that in those tests).

Have you compared ZFS (ZoL) performance to BTRFS?

I did way back in 2013 when we were working with Brian Behlendorf to fix 
xattr bugs in ZOL.  It was quite a bit slower if you didn't enable SA 
xattrs.  With SA xattrs, it was much closer, but not as fast as btrfs or 
xfs.  I didn't do a lot of tuning though and Ceph wasn't making good use 
of ZFS features, so it's very possible things have changed.

Attachment:
Emeror Raw Performance Data.ods

Description: application/vnd.oasis.opendocument.spreadsheet
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com