Re: RBD with PWL cache shows poor performance compared to cache device

Mark Nelson <mark.nelson@xxxxxxxxx> · Thu, 29 Jun 2023 08:10:50 -0500

Hi Matthew!

On 6/29/23 06:23, Matthew Booth wrote:
On Wed, 28 Jun 2023 at 22:44, Ilya Dryomov <idryomov@xxxxxxxxxx> wrote:
** TL;DR

In testing, the write latency performance of a PWL-cache backed RBD
disk was 2 orders of magnitude worse than the disk holding the PWL
cache.

** Summary

I was hoping that PWL cache might be a good solution to the problem of
write latency requirements of etcd when running a kubernetes control
plane on ceph. Etcd is extremely write latency sensitive and becomes
unstable if write latency is too high. The etcd workload can be
characterised by very small (~4k) writes with a queue depth of 1.
Throughput, even on a busy system, is normally very low. As etcd is
distributed and can safely handle the loss of un-flushed data from a
single node, a local ssd PWL cache for etcd looked like an ideal
solution.

Right, this is exactly the use case that the PWL cache is supposed to address.
Good to know!

My expectation was that adding a PWL cache on a local SSD to an
RBD-backed would improve write latency to something approaching the
write latency performance of the local SSD. However, in my testing
adding a PWL cache to an rbd-backed VM increased write latency by
approximately 4x over not using a PWL cache. This was over 100x more
than the write latency performance of the underlying SSD.

My expectation was based on the documentation here:
https://docs.ceph.com/en/quincy/rbd/rbd-persistent-write-log-cache/

“The cache provides two different persistence modes. In
persistent-on-write mode, the writes are completed only when they are
persisted to the cache device and will be readable after a crash. In
persistent-on-flush mode, the writes are completed as soon as it no
longer needs the caller’s data buffer to complete the writes, but does
not guarantee that writes will be readable after a crash. The data is
persisted to the cache device when a flush request is received.”

** Method

2 systems, 1 running single-node Ceph Quincy (17.2.6), the other
running libvirt and mounting a VM’s disk with librbd (also 17.2.6)
from the first node.

All performance testing is from the libvirt system. I tested write
latency performance:

* Inside the VM without a PWL cache
* Of the PWL device directly from the host (direct to filesystem, no VM)
* Inside the VM with a PWL cache

I am testing with fio. Specifically I am running a containerised test,
executed with:
    podman run --volume .:/var/lib/etcd:Z quay.io/openshift-scale/etcd-perf

This container runs:
    fio --rw=write --ioengine=sync --fdatasync=1
--directory=/var/lib/etcd --size=100m --bs=8000 --name=etcd_perf
--output-format=json --runtime=60 --time_based=1

And extracts sync.lat_ns.percentile["99.000000"]

Matthew, do you have the rest of the fio output captured?  It would be interesting to see if it's just the 99th percentile that is bad or the PWL cache is worse in general.
Sure.

With PWL cache: https://paste.openstack.org/show/820504/
Without PWL cache: https://paste.openstack.org/show/b35e71zAwtYR2hjmSRtR/
With PWL cache, 'rbd_cache'=false:
https://paste.openstack.org/show/byp8ZITPzb3r9bb06cPf/

Also, how's the CPU usage client side?  I would be very curious to see 
if unwindpmp shows anything useful (especially lock contention):

https://github.com/markhpc/uwpmp

Just attach it to the client-side process and start out with something 
like 100 samples (more are better but take longer).  You can run it like:

./unwindpmp -n 100 -p <pid>

Mark

** Results

All results were stable across multiple runs within a small margin of error.

* rbd no cache: 1417216 ns
* pwl cache device: 44288 ns
* rbd with pwl cache: 5210112 ns

Note that by adding a PWL cache we increase write latency by
approximately 4x, which is more than 100x than the underlying device.

** Hardware

2 x Dell R640s, each with Xeon Silver 4216 CPU @ 2.10GHz and 192G RAM
Storage under test: 2 x SAMSUNG MZ7KH480HAHQ0D3 SSDs attached to PERC
H730P Mini (Embedded)

OS installed on rotational disks

N.B. Linux incorrectly detects these disks as rotational, which I
assume relates to weird behaviour by the PERC controller. I remembered
to manually correct this on the ‘client’ machine for the PWL cache,
but at OSD configuration time ceph would have detected them as
rotational. They are not rotational.

** Ceph Configuration

CentOS Stream 9

    # ceph version
    ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy
(stable)

Single node installation with cephadm. 2 OSDs, one on each SSD.
1 pool with size 2

** Client Configuration

Fedora 38
Librbd1-17.2.6-3.fc38.x86_64

PWL cache is XFS filesystem with 4k block size, matching the
underlying device. The filesystem uses the whole block device. There
is no other load on the system.

** RBD Configuration

# rbd config image list libvirt-pool/pwl-test | grep cache
rbd_cache                                    true

I wonder if rbd_cache should have been set to false here to disable the default volatile cache.  Other than that, I don't see anything obviously wrong with the configuration at first sight.
I added some full output for this above.

--
Ilya

   config
rbd_cache_block_writes_upfront               false
   config
rbd_cache_max_dirty                          25165824
   config
rbd_cache_max_dirty_age                      1.000000
   config
rbd_cache_max_dirty_object                   0
   config
rbd_cache_policy                             writeback
   pool
rbd_cache_size                               33554432
   config
rbd_cache_target_dirty                       16777216
   config
rbd_cache_writethrough_until_flush           true
   pool
rbd_parent_cache_enabled                     false
   config
rbd_persistent_cache_mode                    ssd
   pool
rbd_persistent_cache_path                    /var/lib/libvirt/images/pwl
   pool
rbd_persistent_cache_size                    1073741824
   config
rbd_plugins                                  pwl_cache
   pool

# rbd status libvirt-pool/pwl-test
Watchers:
          watcher=10.1.240.27:0/1406459716 client.14475
cookie=140282423200720
Persistent cache state:
          host: dell-r640-050
          path:
/var/lib/libvirt/images/pwl/rbd-pwl.libvirt-pool.37e947fd216b.pool
          size: 1 GiB
          mode: ssd
          stats_timestamp: Mon Jun 26 11:29:21 2023
          present: true   empty: false    clean: true
          allocated: 180 MiB
          cached: 135 MiB
          dirty: 0 B
          free: 844 MiB
          hits_full: 1 / 0%
          hits_partial: 3 / 0%
          misses: 21952
          hit_bytes: 6 KiB / 0%
          miss_bytes: 349 MiB

--
Best Regards,
Mark Nelson
Head of R&D (USA)

Clyso GmbH
p: +49 89 21552391 12
a: Loristraße 8 | 80335 München | Germany
w: https://clyso.com | e: mark.nelson@xxxxxxxxx

We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx