Re: RBD persistent writeback cache crash (was: performance)

Wido den Hollander <wido@xxxxxxxx> · Wed, 9 Jun 2021 13:45:40 +0200

On 08/06/2021 21:04, Ilya Dryomov wrote:
On Tue, Jun 8, 2021 at 7:11 PM Wido den Hollander <wido@xxxxxxxx> wrote:

Hi,

So I've been doing some tests with v16.2.4 with a 2TB Samsung PM983 SSD
mounted under /mnt/rbd-cache

rbd_persistent_cache_mode = ssd
rbd_persistent_cache_size = 2G
rbd_persistent_cache_path = /mnt/rbd-cache
rbd_plugins = pwl_cache

I tried both XFS and EXT4 as the filesystem.

This however leads to fio or 'rbd bench' to crash:

root@infra-138-b16-27:~# fio fio/rbd_rw_1.fio
rbd_w_iodepth_1: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W)
4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=1
fio-3.1
Starting 1 process
Segmentation fault1)][13.3%][r=0KiB/s,w=14.7MiB/s][r=0,w=3768 IOPS][eta
00m:52s]
root@infra-138-b16-27:~#

(The IOps seem great!)

My fio test is fairly simple:

[global]
ioengine=rbd
clientname=admin
pool=rbd
rbdname=fio1
invalidate=0
bs=4k
runtime=60
direct=1

[rbd_w_iodepth_1]
rw=randwrite
iodepth=1

I have tried to trace it with gdb, but I didn't get further with my
backtrace then:

(gdb) bt
#0  ContextWQ::process (ctx=0x7fffb8081480, this=0x7fffb8012470) at
./src/common/WorkQueue.h:556
#1  ThreadPool::PointerWQ<Context>::_void_process (this=0x7fffb8012470,
item=0x7fffb8081480, handle=...) at ./src/common/WorkQueue.h:341
#2  0x00007fffec600912 in ThreadPool::worker (this=0x7fffb8012018,
wt=<optimized out>) at ./src/common/WorkQueue.cc:117
#3  0x00007fffec601801 in ThreadPool::WorkThread::entry (this=<optimized
out>) at ./src/common/WorkQueue.h:395
#4  0x00007ffff5c796db in start_thread (arg=0x7fffb17fa700) at
pthread_create.c:463
#5  0x00007ffff579e71f in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Has anybody been able to use pwl_cache successfully?

Hi Wido,

Unfortunately "rbd_persistent_cache_mode = ssd" cache has shipped
rather broken.  This particular crash is most likely already fixed
in master, but there are a few more outstanding.  There is a dozen
of "[pwl ssd] ..." tickets in the rbd project, the fixes would be
backported to pacific once the ssd mode is stable enough.

Until then, I would to stick to "rbd_persistent_cache_mode = rwl"
or avoid the pwl_cache plugin entirely.

I tried with rwl and am now using XFS with DAX enabled and it works.

Performance-wise I see an improvement of 2x in terms of IOps with qd=1 bs=4k

My kernel is reporting that my PM983 4TB NVMe I am using as a backing 
device is 100% util, but that seems off.

Can we expect any fixes for this cache in .5 or .6?

Wido

Thanks,

                 Ilya

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx