For a variety of reasons, a ZFS pool in a QEMU/KVM virtual machine backed by a Ceph RBD doesn’t perform very well. Does anyone have any tuning tips (on either side) for this workload? A fair amount of the problem is probably related to two factors. First, ZFS always assumes it is talking to bare metal drives. This assumption is baked into it at a very fundamental level but Ceph is pretty much the polar opposite of that. For one thing, this makes any type of write caching moderately terrifying from a dataloss standpoint. Although, to be fair, we run some non-critical KVM VM’s with ZFS filesystems and cache=writeback with no observed ill-effects. From the available information, it *seems* safe to do that, but it’s not certain whether under enough stress and the wrong crash at the wrong moment, a lost/corrupted pool would be the result. ZFS is notorious for exploding if the underlying subsystem lies to it about whether data has been permanently written to disk (that bare-metal assumption again); it’s not an area that encourages pressing one’s luck. The second issue is that ZFS likes a huge recordsize. It uses small blocks for small files, but as soon as a file grows a little bit, it is happy to use 128KiB blocks (again assuming it’s talking to a physical disk that can do a sequential read of a whole block with minimal added overhead because the head was already there for the first byte and what’s a little wasted bandwidth on a 6Gbps SAS bus that has nothing else to do). Ceph on the other hand *always* has something else to do, so a 128K read-modify-write cycle to change one byte in the middle of a file winds up being punishingly wasteful. The RBD striping explanation ( on http://docs.ceph.com/docs/hammer/man/8/rbd/ ) seems to suggest that the default object size is 4M, so at least a single 128K read/write should only hit one or (at most) two objects. Whether it’s one or two seems to depend on whether ZFS has a useful interpretation of track size, which it may not. One such virtual machine reports for a 1TB ceph image, 62 sectors of 512 bytes per track, or 31K tracks. Which could lead to a fair number of object-straddling reads and writes at a 128K object size. So the main impact of that is massive write amplification; writing one byte can turn into reading and writing 128K from/to 2-6 different OSDs. All of which winds up passing over the storage LAN, introducing tons of latency compared to that hypothetical 6Gbps SAS read that ZFS is designed to expect. If it helps establish a baseline, the reason this subject comes up is that currently ZFS filesystems on RBD-backed QEMU VM’s do stuff like this: (iostat -x at 10-second intervals) Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util vdc 0.00 0.00 41.00 31.30 86.35 4006.40 113.22 2.07 27.18 24.10 31.22 13.82 99.92 vdc 0.00 0.00 146.30 38.10 414.95 4876.80 57.39 2.46 13.64 10.36 26.25 5.42 99.96 vdc 0.00 0.00 127.30 102.20 256.40 13081.60 116.24 2.07 9.19 8.57 9.97 4.35 99.88 vdc 0.00 0.00 160.80 160.70 297.30 10592.80 67.75 1.21 3.76 1.73 5.78 2.91 93.68 That’s… not great… for a low-load 10G LAN Ceph cluster with 60 Intel DC S37X0 SSD’s. Is there some tuning that could be done (on any side, ZFS, QEMU, or Ceph) to optimize performance? Are there any metrics we could collect to gain more insight into what and where the bottlenecks are? Some combination of changing the ZFS max recordsize, the QEMU virtual disk geometry, and Ceph backend settings seems like it might make a big difference, but there are many combinations, and it feels like guesswork with the available information. So it seems worthwhile to ask if anyone has been down this road and if so what they found before spending a week or two rediscovering the wheel. Thanks for any advice! _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com