Re: Tuning ZFS + QEMU/KVM + Ceph RBD’s

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



A bunch of ideas. I'm not an expert in either ceph or zfs so take with appropriately sized boulders of salt.

You might want to ask this in the zfs list as well, especially if your ceph cluster is serving non-zfs workloads you're more likely to want to tune zfs for ceph rather than the other way around. (Apologies if you have and I missed it).

Are your zfs disks' 128kb aligned with Ceph's 4M objects? If you're not seeing massive write amplification (it doesn't sound like you are but who knows) this is unlikely to be the case but if it is it's a nice easy fix: give zfs an appropriately-offset partition. [0]

I assume you're mapping a single rbd device to a single zfs vdev, you might benefit from a higher queue size. ZFS is assuming that each vdev is something like a physical disk instead of a magical distributed block device; setting zfs_vdev_max_pending [1] might help, rbd ought to be fairly similar to a big iscsi lun from a tuning POV.

ZFS really needs write barriers to be respected which ceph rbd caching purports to do. Certainly it seems like it should be safe but of course who really knows. I'd want to spend some time killing VMs/partitioning the rbd client from the ceph cluster and ensuring the pool comes up happy before I thought about rolling it out but it's not a crazy idea IMO.

Setting the ceph object size to 128k would be interesting.

Remember that both Ceph and ZFS are both trading some amount of performance to provide security guarantees in the presence of untrustworthy hardware. Doing that twice is going to be bad.

I'd think about exactly where you want your redundancy to lie and allowing the other components to assume that it already exists. If you have non-zfs rbd workloads, maybe you want to set nocacheflush or even disable the zil and trust that ceph is more reliable than the hardware zfs assumes it's using.

Alternatively, you could go the other way, build a ceph pool with reduced redundancy and pass multiple rbds into a raidz or something.

Finally, if your workloads don't migrate much you might be able to set up a non-ceph slog device (or 2..) to eat the zil-induced latency.

On Mon, Dec 28, 2015 at 6:59 PM, J David <j.david.lists@xxxxxxxxx> wrote:
Yes, given the architectural design limitations of ZFS, there will
indeed always be performance consequences for using it in an
environment its creators never envisioned, like Ceph.  But ZFS offers
many advanced features not found on other filesystems, and for
production environments that depend on those features, it’s very
reasonable to still want them in an environment that happens to be
backed

Keep in mind also that FreeBSD and Solaris installers both create ZFS
filesystems (Solaris by default/only option, FreeBSD I’m not sure
about, it may be default in the most recent release), so this is not
just a question about ZFS on Linux.  ZFS is a *very* popular
filesystem in wide usage and is the *only* cross-platform filesystem
to offer the features it does.

So, until there’s another broadly-supported, ceph-aware,
production-quality filesystem that offers feature parity with it, the
question of how to get the best (or, if you prefer, least worst)
ZFS-on-ceph performance is worth asking.

In light of that, is it possible to do any better than just writing it
off as a lost cause?  This is work we’re absolutely willing to do, we
just don’t feel we have a good understanding of all the moving parts
involved, and how to measure and tune them all.  (And, most
importantly, how to measure the impact of the tuning.)

Thanks!

On Fri, Dec 25, 2015 at 9:06 PM, Tyler Bishop
<tyler.bishop@xxxxxxxxxxxxxxxxx> wrote:
> Due to the nature of distributed storage and a filesystem built to distribute itself across sequential devices.. you're going to always have poor performance.
>
> Are you unable to use XFS inside the vm?
>
>
> If you are not the intended recipient of this transmission you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Patrick Hahn
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux