Re: Ceph on Solaris / Illumos

Michal Kozanecki <mkozanecki@xxxxxxxxxx> · Fri, 17 Apr 2015 16:05:08 +0000

Performance on ZFS on Linux (ZoL) seems to be fine, as long as you use the CEPH generic filesystem implementation (writeahead) and not the specific CEPH ZFS implementation, CoW snapshoting that CEPH does with ZFS support compiled in absolutely kills performance. I suspect the same would go with CEPH on Illumos on ZFS. Otherwise it is comparable to XFS in my own testing once tweaked. 

There are a few oddities/quirks with ZFS performance that need to be tweaked when using it with CEPH, and yea enabling SA on xattr is one of them.

1. ZFS recordsize - The ZFS "sector size", known as within ZFS as the recordsize is technically dynamic. It only enforces the maximum size, however the way CEPH writes and reads from objects (when working with smaller blocks, let's say 4k or 8k via rbd) with default settings seems to be affected by the recordsize. With the default 128K I've found lower IOPS and higher latency. Setting the recordsize too low will inflate various ZFS metadata, so it needs to be balanced against how your CEPH pool will be used. 

For rbd pools(where small block performance may be important) a recordsize of 32K seems to be a good balance. For pure large object based use (rados, etc) the 128K default is fine, throughput is high(small block performance isn't important here). See following links for more info about recordsize: https://blogs.oracle.com/roch/entry/tuning_zfs_recordsize and https://www.joyent.com/blog/bruning-questions-zfs-record-size

2. XATTR - I didn't do much testing here, I've read that if you do not set xattr = sa on ZFS you will get poor performance. There were also stability issues in the past with xattr = sa on ZFS though it seems all resolved now and I have not encountered any issues myself. I'm unsure what the default setting is here, I always enable it.

Make sure you enable and set xattr = sa on ZFS.

3. ZIL(ZFS Intent Log, also known as the slog) is a MUST (even with a separate ceph journal) - It appears that while the ceph journal offloads/absorbs writes nicely and boosts performance, it does not consolidate writes enough for ZFS. Without a ZIL/SLOG your performance will be very sawtooth like (jumpy, stutter, aka fast then slow, fast than slow over a period of 10-15 seconds). 

In theory tweaking the various ZFS TXG sync settings might work, but it is overly complicated to maintain and likely would only apply to the specific underlying disk model. Disabling sync also resolves this, though you'll lose the last TXG on a power failure - this might be okay with CEPH, but since I'm unsure I'll just assume it is not. IMHO avoid too much evil tuning, just add a ZIL/SLOG.   

4. ZIL/SLOG + on-device ceph journal vs ZIL/SLOG + separate ceph journal - Performance is very similar, if you have a ZIL/SLOG you could easily get away without a separate ceph journal and leave it on the device/ZFS dataset. HOWEVER this causes HUGE amounts of fragmentation due to the CoW nature. After only a few days usage, performance tanked with the ceph journal on the same device. 

I did find that if you partition and share device/SSD between both ZIL/SLOG and a separate ceph journal, the resulting performance is about the same in pure throughput/iops, though latency is slightly higher. This is what I do in my test cluster.

5. Fragmentation - once you hit around 80-90% disk usage your performance will start to slow down due to fragmentation. This isn't due to CEPH, it’s a known ZFS quirk due to its CoW nature. Unfortunately there is no defrag in ZFS, and likely never will be (the mythical block point rewrite unicorn you'll find people talking about). 

There is one way to delay it and possibly avoid it however, enable metaslab_debug, this will put the ZFS spacemaps in memory, allowing ZFS to make better placements during CoW operations, but it does use more memory. See the following links for more detail about spacemaps and fragmentation: http://blog.delphix.com/uday/2013/02/19/78/ and http://serverfault.com/a/556892 and http://www.mail-archive.com/zfs-discuss@xxxxxxxxxxxxxxx/msg45408.html 

There's alot more to ZFS and "things-to-know" than that (L2ARC uses ARC metadata space, dedupe uses ARC metadata space, etc), but as far as CEPH is cocearned the above is a good place to start. ZFS IMHO is a great solution, but it requires some time and effort to do it right.

Cheers,

Michal Kozanecki | Linux Administrator | E: mkozanecki@xxxxxxxxxx

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Mark Nelson
Sent: April-15-15 12:22 PM
To: Jake Young
Cc: ceph-users@xxxxxxxxxxxxxx
Subject: Re:  Ceph on Solaris / Illumos

On 04/15/2015 10:36 AM, Jake Young wrote:
>
>
> On Wednesday, April 15, 2015, Mark Nelson <mnelson@xxxxxxxxxx 
> <mailto:mnelson@xxxxxxxxxx>> wrote:
>
>
>
>     On 04/15/2015 08:16 AM, Jake Young wrote:
>
>         Has anyone compiled ceph (either osd or client) on a Solaris
>         based OS?
>
>         The thread on ZFS support for osd got me thinking about using
>         solaris as
>         an osd server. It would have much better ZFS performance and I
>         wonder if
>         the osd performance without a journal would be 2x better.
>
>
>     Doubt it.  You may be able to do a little better, but you have to
>     pay the piper some how.  If you clone from journal you will
>     introduce fragmentation.  If you throw the journal away you'll
>     suffer for everything but very large writes unless you throw safety
>     away.  I think if we are going to generally beat filestore (not just
>     for optimal benchmarking tests!) it's going to take some very
>     careful cleverness. Thankfully Sage is very clever and is working on
>     it in newstore. Even there, filestore has been proving difficult to
>     beat for writes.
>
>
> That's interesting. I've been under the impression that the ideal osd 
> config was using a stable and fast BTRFS (which doesn't exist
> yet) with no journal.

This is sort of unrelated to the journal specifically, but BTRFS with RBD will start fragmenting terribly due to how COW works (and how it relates to snapshots too).  More related to the journal:  At one point we were thinking about cloning from the journal on BTRFS, but that also potentially leads to nasty fragmentation even if the initial behavior would look very good.  I haven't done any testing that I can remember of BTRFS with no journal.  I'm not sure if it even still works...

>
> In my specific case, I don't want to use an external journal. I've 
> gone down the path of using RAID controllers with write-back cache and 
> BBUs with each disk in its own RAID0 group, instead of SSD journals. 
> (Thanks for your performance articles BTW, they were very helpful!)
>
> My take on your results indicates that IO throughput performance on 
> XFS with same disk journal and WB cache on the RAID card was basically 
> the same or better than BTRFS with no journal.  In addition, BTRFS 
> typically used much more CPU.
>
> Has BTRFS performance gotten any better since you wrote the 
> performance articles?

So the trick with those articles is that the systems are fresh, and most of the initial articles were using rados bench which is always writing out new objects vs something like RBD where you are (usually) doing writes to existing objects that represent the blocks.  If you were to do a bunch of random 4k writes and then later try to do sequential reads, you'd see BTRFS sequential read performance tank.  We actually did tests like that with emperor during the firefly development cycle.  I've included the results. Basically the first iteration of the test cycle looks great on BTRFS, then you see read performance drop way down. 
Eventually write performance also is likely drop as the disks become extremely fragmented (we may even see a little of that in those tests).

>
> Have you compared ZFS (ZoL) performance to BTRFS?

I did way back in 2013 when we were working with Brian Behlendorf to fix xattr bugs in ZOL.  It was quite a bit slower if you didn't enable SA xattrs.  With SA xattrs, it was much closer, but not as fast as btrfs or xfs.  I didn't do a lot of tuning though and Ceph wasn't making good use of ZFS features, so it's very possible things have changed.

>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com