Re: RBD with PWL cache shows poor performance compared to cache device

Matthew Booth <mbooth@xxxxxxxxxx> · Tue, 27 Jun 2023 18:50:07 +0100

On Tue, 27 Jun 2023 at 18:20, Josh Baergen <jbaergen@xxxxxxxxxxxxxxxx> wrote:
>
> Hi Matthew,
>
> We've done a limited amount of work on characterizing the pwl and I think it suffers the classic problem of some writeback caches in that, once the cache is saturated, it's actually worse than just being in writethrough. IIRC the pwl does try to preserve write ordering (unlike the other writeback/writearound modes) which limits it in the concurrency it can issue to the backend, which means that even an iodepth=1 test can saturate the pwl, assuming the backend latency is higher than the pwl latency.

What do you mean by saturated here? FWIW I was using the default cache
size of 1G and each test run only wrote ~100MB of data, so I don't
think I ever filled the cache, even with multiple runs.

> I _think_ that if you were able to devise a burst test with bursts smaller than the pwl capacity and gaps in between large enough for the cache to flush, or if you were to ratelimit I/Os to the pwl, that you should see closer to the lower latencies that you would expect.

My goal is to characterise the requirements of etcd. Unfortunately I
don't think changing the test would do that. Incidentally, note that
the total bandwidth of an extremely busy etcd is usually very low.
>From memory, the etcd write rate for a system we were debugging whose
etcd was occasionally falling over due to load was only about 5MiB/s.
It's all about write latency of really small writes, not bandwidth.

Matt

>
> Josh
>
> On Tue, Jun 27, 2023 at 9:04 AM Matthew Booth <mbooth@xxxxxxxxxx> wrote:
>>
>> ** TL;DR
>>
>> In testing, the write latency performance of a PWL-cache backed RBD
>> disk was 2 orders of magnitude worse than the disk holding the PWL
>> cache.
>>
>> ** Summary
>>
>> I was hoping that PWL cache might be a good solution to the problem of
>> write latency requirements of etcd when running a kubernetes control
>> plane on ceph. Etcd is extremely write latency sensitive and becomes
>> unstable if write latency is too high. The etcd workload can be
>> characterised by very small (~4k) writes with a queue depth of 1.
>> Throughput, even on a busy system, is normally very low. As etcd is
>> distributed and can safely handle the loss of un-flushed data from a
>> single node, a local ssd PWL cache for etcd looked like an ideal
>> solution.
>>
>> My expectation was that adding a PWL cache on a local SSD to an
>> RBD-backed would improve write latency to something approaching the
>> write latency performance of the local SSD. However, in my testing
>> adding a PWL cache to an rbd-backed VM increased write latency by
>> approximately 4x over not using a PWL cache. This was over 100x more
>> than the write latency performance of the underlying SSD.
>>
>> My expectation was based on the documentation here:
>> https://docs.ceph.com/en/quincy/rbd/rbd-persistent-write-log-cache/
>>
>> “The cache provides two different persistence modes. In
>> persistent-on-write mode, the writes are completed only when they are
>> persisted to the cache device and will be readable after a crash. In
>> persistent-on-flush mode, the writes are completed as soon as it no
>> longer needs the caller’s data buffer to complete the writes, but does
>> not guarantee that writes will be readable after a crash. The data is
>> persisted to the cache device when a flush request is received.”
>>
>> ** Method
>>
>> 2 systems, 1 running single-node Ceph Quincy (17.2.6), the other
>> running libvirt and mounting a VM’s disk with librbd (also 17.2.6)
>> from the first node.
>>
>> All performance testing is from the libvirt system. I tested write
>> latency performance:
>>
>> * Inside the VM without a PWL cache
>> * Of the PWL device directly from the host (direct to filesystem, no VM)
>> * Inside the VM with a PWL cache
>>
>> I am testing with fio. Specifically I am running a containerised test,
>> executed with:
>>   podman run --volume .:/var/lib/etcd:Z quay.io/openshift-scale/etcd-perf
>>
>> This container runs:
>>   fio --rw=write --ioengine=sync --fdatasync=1
>> --directory=/var/lib/etcd --size=100m --bs=8000 --name=etcd_perf
>> --output-format=json --runtime=60 --time_based=1
>>
>> And extracts sync.lat_ns.percentile["99.000000"]
>>
>> ** Results
>>
>> All results were stable across multiple runs within a small margin of error.
>>
>> * rbd no cache: 1417216 ns
>> * pwl cache device: 44288 ns
>> * rbd with pwl cache: 5210112 ns
>>
>> Note that by adding a PWL cache we increase write latency by
>> approximately 4x, which is more than 100x than the underlying device.
>>
>> ** Hardware
>>
>> 2 x Dell R640s, each with Xeon Silver 4216 CPU @ 2.10GHz and 192G RAM
>> Storage under test: 2 x SAMSUNG MZ7KH480HAHQ0D3 SSDs attached to PERC
>> H730P Mini (Embedded)
>>
>> OS installed on rotational disks
>>
>> N.B. Linux incorrectly detects these disks as rotational, which I
>> assume relates to weird behaviour by the PERC controller. I remembered
>> to manually correct this on the ‘client’ machine for the PWL cache,
>> but at OSD configuration time ceph would have detected them as
>> rotational. They are not rotational.
>>
>> ** Ceph Configuration
>>
>> CentOS Stream 9
>>
>>   # ceph version
>>   ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
>>
>> Single node installation with cephadm. 2 OSDs, one on each SSD.
>> 1 pool with size 2
>>
>> ** Client Configuration
>>
>> Fedora 38
>> Librbd1-17.2.6-3.fc38.x86_64
>>
>> PWL cache is XFS filesystem with 4k block size, matching the
>> underlying device. The filesystem uses the whole block device. There
>> is no other load on the system.
>>
>> ** RBD Configuration
>>
>> # rbd config image list libvirt-pool/pwl-test | grep cache
>> rbd_cache                                    true                         config
>> rbd_cache_block_writes_upfront               false                        config
>> rbd_cache_max_dirty                          25165824                     config
>> rbd_cache_max_dirty_age                      1.000000                     config
>> rbd_cache_max_dirty_object                   0                            config
>> rbd_cache_policy                             writeback                    pool
>> rbd_cache_size                               33554432                     config
>> rbd_cache_target_dirty                       16777216                     config
>> rbd_cache_writethrough_until_flush           true                         pool
>> rbd_parent_cache_enabled                     false                        config
>> rbd_persistent_cache_mode                    ssd                          pool
>> rbd_persistent_cache_path                    /var/lib/libvirt/images/pwl  pool
>> rbd_persistent_cache_size                    1073741824                   config
>> rbd_plugins                                  pwl_cache                    pool
>>
>> # rbd status libvirt-pool/pwl-test
>> Watchers:
>>         watcher=10.1.240.27:0/1406459716 client.14475 cookie=140282423200720
>> Persistent cache state:
>>         host: dell-r640-050
>>         path: /var/lib/libvirt/images/pwl/rbd-pwl.libvirt-pool.37e947fd216b.pool
>>         size: 1 GiB
>>         mode: ssd
>>         stats_timestamp: Mon Jun 26 11:29:21 2023
>>         present: true   empty: false    clean: true
>>         allocated: 180 MiB
>>         cached: 135 MiB
>>         dirty: 0 B
>>         free: 844 MiB
>>         hits_full: 1 / 0%
>>         hits_partial: 3 / 0%
>>         misses: 21952
>>         hit_bytes: 6 KiB / 0%
>>         miss_bytes: 349 MiB
>> --
>> Matthew Booth
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx

-- 
Matthew Booth
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx