Re: Tuning ZFS + QEMU/KVM + Ceph RBD’s

Tyler Bishop <tyler.bishop@xxxxxxxxxxxxxxxxx> · Fri, 25 Dec 2015 21:06:25 -0500 (EST)

Due to the nature of distributed storage and a filesystem built to distribute itself across sequential devices.. you're going to always have poor performance.

Are you unable to use XFS inside the vm?

If you are not the intended recipient of this transmission you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.

----- Original Message -----
From: "J David" <j.david.lists@xxxxxxxxx>
To: ceph-users@xxxxxxxxxxxxxx
Sent: Thursday, December 24, 2015 1:10:36 PM
Subject:  Tuning ZFS + QEMU/KVM + Ceph RBD’s

For a variety of reasons, a ZFS pool in a QEMU/KVM virtual machine
backed by a Ceph RBD doesn’t perform very well.

Does anyone have any tuning tips (on either side) for this workload?

A fair amount of the problem is probably related to two factors.

First, ZFS always assumes it is talking to bare metal drives.  This
assumption is baked into it at a very fundamental level but Ceph is
pretty much the polar opposite of that.  For one thing, this makes any
type of write caching moderately terrifying from a dataloss
standpoint.

Although, to be fair, we run some non-critical KVM VM’s with ZFS
filesystems and cache=writeback with no observed ill-effects.  From
the available information, it *seems* safe to do that, but it’s not
certain whether under enough stress and the wrong crash at the wrong
moment, a lost/corrupted pool would be the result.  ZFS is notorious
for exploding if the underlying subsystem lies to it about whether
data has been permanently written to disk (that bare-metal assumption
again); it’s not an area that encourages pressing one’s luck.

The second issue is that ZFS likes a huge recordsize.  It uses small
blocks for small files, but as soon as a file grows a little bit, it
is happy to use 128KiB blocks (again assuming it’s talking to a
physical disk that can do a sequential read of a whole block with
minimal added overhead because the head was already there for the
first byte and what’s a little wasted bandwidth on a 6Gbps SAS bus
that has nothing else to do).

Ceph on the other hand *always* has something else to do, so a 128K
read-modify-write cycle to change one byte in the middle of a file
winds up being punishingly wasteful.

The RBD striping explanation ( on
http://docs.ceph.com/docs/hammer/man/8/rbd/ ) seems to suggest that
the default object size is 4M, so at least a single 128K read/write
should only hit one or (at most) two objects.

Whether it’s one or two seems to depend on whether ZFS has a useful
interpretation of track size, which it may not.  One such virtual
machine reports for a 1TB ceph image, 62 sectors of 512 bytes per
track, or 31K tracks.  Which could lead to a fair number of
object-straddling reads and writes at a 128K object size.

So the main impact of that is massive write amplification; writing one
byte can turn into reading and writing 128K from/to 2-6 different
OSDs.  All of which winds up passing over the storage LAN, introducing
tons of latency compared to that hypothetical 6Gbps SAS read that ZFS
is designed to expect.

If it helps establish a baseline, the reason this subject comes up is
that currently ZFS filesystems on RBD-backed QEMU VM’s do stuff like
this:

(iostat -x at 10-second intervals)

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vdc               0.00     0.00   41.00   31.30    86.35  4006.40
113.22     2.07   27.18   24.10   31.22  13.82  99.92
vdc               0.00     0.00  146.30   38.10   414.95  4876.80
57.39     2.46   13.64   10.36   26.25   5.42  99.96
vdc               0.00     0.00  127.30  102.20   256.40 13081.60
116.24     2.07    9.19    8.57    9.97   4.35  99.88
vdc               0.00     0.00  160.80  160.70   297.30 10592.80
67.75     1.21    3.76    1.73    5.78   2.91  93.68

That’s… not great… for a low-load 10G LAN Ceph cluster with 60 Intel
DC S37X0 SSD’s.

Is there some tuning that could be done (on any side, ZFS, QEMU, or
Ceph) to optimize performance?

Are there any metrics we could collect to gain more insight into what
and where the bottlenecks are?

Some combination of changing the ZFS max recordsize, the QEMU virtual
disk geometry, and Ceph backend settings seems like it might make a
big difference, but there are many combinations, and it feels like
guesswork with the available information.

So it seems worthwhile to ask if anyone has been down this road and if
so what they found before spending a week or two rediscovering the
wheel.

Thanks for any advice!
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com