Re: RBD with PWL cache shows poor performance compared to cache device

"Yin, Congmin" <congmin.yin@xxxxxxxxx> · Fri, 30 Jun 2023 07:50:22 +0000

Hi Matthew,

Due to the latency of rbd layers, the write latency of the pwl cache is more than ten times that of the Raw device.
I replied directly below the 2 questions.

Best regards.
Congmin Yin

-----Original Message-----
From: Matthew Booth <mbooth@xxxxxxxxxx> 
Sent: Thursday, June 29, 2023 7:23 PM
To: Ilya Dryomov <idryomov@xxxxxxxxxx>
Cc: Giulio Fidente <gfidente@xxxxxxxxxx>; Yin, Congmin <congmin.yin@xxxxxxxxx>; Tang, Guifeng <guifeng.tang@xxxxxxxxx>; Vikhyat Umrao <vumrao@xxxxxxxxxx>; Jdurgin <Jdurgin@xxxxxxxxxx>; John Fulton <johfulto@xxxxxxxxxx>; Francesco Pantano <fpantano@xxxxxxxxxx>; ceph-users@xxxxxxx
Subject: Re:  RBD with PWL cache shows poor performance compared to cache device

On Wed, 28 Jun 2023 at 22:44, Ilya Dryomov <idryomov@xxxxxxxxxx> wrote:
>> ** TL;DR
>>
>> In testing, the write latency performance of a PWL-cache backed RBD 
>> disk was 2 orders of magnitude worse than the disk holding the PWL 
>> cache.

PWL cache can use pmem or SSD as cache devices. Using PMEM, based on my test environment at that time, I can give specific data as follows: the write latency of the pmem Raw device is about 10+us, the write latency of the pwl cache is about 100us+(from the latency of the rbd layers), and the write latency of the ceph cluster is about 1000+us(from messengers and network). But for SSDs, there are many types, and I cannot provide a specific value, but it will definitely be worse than pmem. So, for a phenomenon that is 2 orders of magnitude lower, it is worse than expected. Can you provide detailed values of the three for analysis. (SSD, pwl cache, ceph cluster)
==============================================================

>>
>> ** Summary
>>
>> I was hoping that PWL cache might be a good solution to the problem 
>> of write latency requirements of etcd when running a kubernetes 
>> control plane on ceph. Etcd is extremely write latency sensitive and 
>> becomes unstable if write latency is too high. The etcd workload can 
>> be characterised by very small (~4k) writes with a queue depth of 1.
>> Throughput, even on a busy system, is normally very low. As etcd is 
>> distributed and can safely handle the loss of un-flushed data from a 
>> single node, a local ssd PWL cache for etcd looked like an ideal 
>> solution.
>
>
> Right, this is exactly the use case that the PWL cache is supposed to address.

Good to know!

>> My expectation was that adding a PWL cache on a local SSD to an 
>> RBD-backed would improve write latency to something approaching the 
>> write latency performance of the local SSD. However, in my testing 
>> adding a PWL cache to an rbd-backed VM increased write latency by 
>> approximately 4x over not using a PWL cache. This was over 100x more 
>> than the write latency performance of the underlying SSD.

When using image as the VM's disk, you may have used commands like the following. In many cases, using parameters such as writeback will force the start of rbd cache, which is a memory cache. It is normal for pwl cache to be several times slower than it. Please confirm. 
There is currently no parameter support for using only pwl cache instead of rbd cache. I have tested the latency of using pwl cache (pmem) by modifying the code myself, which is about twice as high as using rbd cache.

qemu -m 1024 -drive format=raw,file=rbd:data/squeeze:rbd_cache=true,cache=writeback
==============================================================

>>
>> My expectation was based on the documentation here:
>> https://docs.ceph.com/en/quincy/rbd/rbd-persistent-write-log-cache/
>>
>> “The cache provides two different persistence modes. In 
>> persistent-on-write mode, the writes are completed only when they are 
>> persisted to the cache device and will be readable after a crash. In 
>> persistent-on-flush mode, the writes are completed as soon as it no 
>> longer needs the caller’s data buffer to complete the writes, but 
>> does not guarantee that writes will be readable after a crash. The 
>> data is persisted to the cache device when a flush request is received.”
>>
>> ** Method
>>
>> 2 systems, 1 running single-node Ceph Quincy (17.2.6), the other 
>> running libvirt and mounting a VM’s disk with librbd (also 17.2.6) 
>> from the first node.
>>
>> All performance testing is from the libvirt system. I tested write 
>> latency performance:
>>
>> * Inside the VM without a PWL cache
>> * Of the PWL device directly from the host (direct to filesystem, no 
>> VM)
>> * Inside the VM with a PWL cache
>>
>> I am testing with fio. Specifically I am running a containerised 
>> test, executed with:
>>    podman run --volume .:/var/lib/etcd:Z 
>> quay.io/openshift-scale/etcd-perf
>>
>> This container runs:
>>    fio --rw=write --ioengine=sync --fdatasync=1 
>> --directory=/var/lib/etcd --size=100m --bs=8000 --name=etcd_perf 
>> --output-format=json --runtime=60 --time_based=1
>>
>> And extracts sync.lat_ns.percentile["99.000000"]
>
>
> Matthew, do you have the rest of the fio output captured?  It would be interesting to see if it's just the 99th percentile that is bad or the PWL cache is worse in general.

Sure.

With PWL cache: https://paste.openstack.org/show/820504/
Without PWL cache: https://paste.openstack.org/show/b35e71zAwtYR2hjmSRtR/
With PWL cache, 'rbd_cache'=false:
https://paste.openstack.org/show/byp8ZITPzb3r9bb06cPf/

>> ** Results
>>
>> All results were stable across multiple runs within a small margin of error.
>>
>> * rbd no cache: 1417216 ns
>> * pwl cache device: 44288 ns
>> * rbd with pwl cache: 5210112 ns
>>
>> Note that by adding a PWL cache we increase write latency by 
>> approximately 4x, which is more than 100x than the underlying device.
>>
>> ** Hardware
>>
>> 2 x Dell R640s, each with Xeon Silver 4216 CPU @ 2.10GHz and 192G RAM 
>> Storage under test: 2 x SAMSUNG MZ7KH480HAHQ0D3 SSDs attached to PERC 
>> H730P Mini (Embedded)
>>
>> OS installed on rotational disks
>>
>> N.B. Linux incorrectly detects these disks as rotational, which I 
>> assume relates to weird behaviour by the PERC controller. I 
>> remembered to manually correct this on the ‘client’ machine for the 
>> PWL cache, but at OSD configuration time ceph would have detected 
>> them as rotational. They are not rotational.
>>
>> ** Ceph Configuration
>>
>> CentOS Stream 9
>>
>>    # ceph version
>>    ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) 
>> quincy
>> (stable)
>>
>> Single node installation with cephadm. 2 OSDs, one on each SSD.
>> 1 pool with size 2
>>
>> ** Client Configuration
>>
>> Fedora 38
>> Librbd1-17.2.6-3.fc38.x86_64
>>
>> PWL cache is XFS filesystem with 4k block size, matching the 
>> underlying device. The filesystem uses the whole block device. There 
>> is no other load on the system.
>>
>> ** RBD Configuration
>>
>> # rbd config image list libvirt-pool/pwl-test | grep cache
>> rbd_cache                                    true
>
>
> I wonder if rbd_cache should have been set to false here to disable the default volatile cache.  Other than that, I don't see anything obviously wrong with the configuration at first sight.

I added some full output for this above.

>
> --
> Ilya
>
>>   config
>> rbd_cache_block_writes_upfront               false
>>   config
>> rbd_cache_max_dirty                          25165824
>>   config
>> rbd_cache_max_dirty_age                      1.000000
>>   config
>> rbd_cache_max_dirty_object                   0
>>   config
>> rbd_cache_policy                             writeback
>>   pool
>> rbd_cache_size                               33554432
>>   config
>> rbd_cache_target_dirty                       16777216
>>   config
>> rbd_cache_writethrough_until_flush           true
>>   pool
>> rbd_parent_cache_enabled                     false
>>   config
>> rbd_persistent_cache_mode                    ssd
>>   pool
>> rbd_persistent_cache_path                    /var/lib/libvirt/images/pwl
>>   pool
>> rbd_persistent_cache_size                    1073741824
>>   config
>> rbd_plugins                                  pwl_cache
>>   pool
>>
>> # rbd status libvirt-pool/pwl-test
>> Watchers:
>>          watcher=10.1.240.27:0/1406459716 client.14475
>> cookie=140282423200720
>> Persistent cache state:
>>          host: dell-r640-050
>>          path:
>> /var/lib/libvirt/images/pwl/rbd-pwl.libvirt-pool.37e947fd216b.pool
>>          size: 1 GiB
>>          mode: ssd
>>          stats_timestamp: Mon Jun 26 11:29:21 2023
>>          present: true   empty: false    clean: true
>>          allocated: 180 MiB
>>          cached: 135 MiB
>>          dirty: 0 B
>>          free: 844 MiB
>>          hits_full: 1 / 0%
>>          hits_partial: 3 / 0%
>>          misses: 21952
>>          hit_bytes: 6 KiB / 0%
>>          miss_bytes: 349 MiB

--
Matthew Booth

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx