Re: Tuning ZFS + QEMU/KVM + Ceph RBD’s

Patrick Hahn <skorgu@xxxxxxxxx> · Tue, 29 Dec 2015 13:42:39 -0500

A bunch of ideas. I'm not an expert in either ceph or zfs so take with appropriately sized boulders of salt. 

You might want to ask this in the zfs list as well, especially if your ceph cluster is serving non-zfs workloads you're more likely to want to tune zfs for ceph rather than the other way around. (Apologies if you have and I missed it).

Are your zfs disks' 128kb aligned with Ceph's 4M objects? If you're not seeing massive write amplification (it doesn't sound like you are but who knows) this is unlikely to be the case but if it is it's a nice easy fix: give zfs an appropriately-offset partition. [0]

I assume you're mapping a single rbd device to a single zfs vdev, you might benefit from a higher queue size. ZFS is assuming that each vdev is something like a physical disk instead of a magical distributed block device; setting zfs_vdev_max_pending [1] might help, rbd ought to be fairly similar to a big iscsi lun from a tuning POV. 

ZFS really needs write barriers to be respected which ceph rbd caching purports to do. Certainly it seems like it should be safe but of course who really knows. I'd want to spend some time killing VMs/partitioning the rbd client from the ceph cluster and ensuring the pool comes up happy before I thought about rolling it out but it's not a crazy idea IMO. 

Setting the ceph object size to 128k would be interesting.

Remember that both Ceph and ZFS are both trading some amount of performance to provide security guarantees in the presence of untrustworthy hardware. Doing that twice is going to be bad. 

I'd think about exactly where you want your redundancy to lie and allowing the other components to assume that it already exists. If you have non-zfs rbd workloads, maybe you want to set nocacheflush or even disable the zil and trust that ceph is more reliable than the hardware zfs assumes it's using. 

Alternatively, you could go the other way, build a ceph pool with reduced redundancy and pass multiple rbds into a raidz or something. 

Finally, if your workloads don't migrate much you might be able to set up a non-ceph slog device (or 2..) to eat the zil-induced latency. 

[0] http://www.oracle.com/technetwork/server-storage/sun-unified-storage/documentation/partitionalign-111512-1875560.pdf
[1] http://www.c0t0d0s0.org/archives/7370-A-little-change-of-queues.html

On Mon, Dec 28, 2015 at 6:59 PM, J David <j.david.lists@xxxxxxxxx> wrote:
Yes, given the architectural design limitations of ZFS, there will

indeed always be performance consequences for using it in an

environment its creators never envisioned, like Ceph.  But ZFS offers

many advanced features not found on other filesystems, and for

production environments that depend on those features, it’s very

reasonable to still want them in an environment that happens to be

backed

Keep in mind also that FreeBSD and Solaris installers both create ZFS

filesystems (Solaris by default/only option, FreeBSD I’m not sure

about, it may be default in the most recent release), so this is not

just a question about ZFS on Linux.  ZFS is a *very* popular

filesystem in wide usage and is the *only* cross-platform filesystem

to offer the features it does.

So, until there’s another broadly-supported, ceph-aware,

production-quality filesystem that offers feature parity with it, the

question of how to get the best (or, if you prefer, least worst)

ZFS-on-ceph performance is worth asking.

In light of that, is it possible to do any better than just writing it

off as a lost cause?  This is work we’re absolutely willing to do, we

just don’t feel we have a good understanding of all the moving parts

involved, and how to measure and tune them all.  (And, most

importantly, how to measure the impact of the tuning.)

Thanks!

On Fri, Dec 25, 2015 at 9:06 PM, Tyler Bishop

<tyler.bishop@xxxxxxxxxxxxxxxxx> wrote:

> Due to the nature of distributed storage and a filesystem built to distribute itself across sequential devices.. you're going to always have poor performance.

>

> Are you unable to use XFS inside the vm?

>

>

> If you are not the intended recipient of this transmission you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.

>

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Patrick Hahn

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com