Re: RBD persistent writeback cache crash (was: performance)

Ilya Dryomov <idryomov@xxxxxxxxx> · Wed, 9 Jun 2021 13:57:17 +0200

On Wed, Jun 9, 2021 at 1:45 PM Wido den Hollander <wido@xxxxxxxx> wrote:
>
>
>
> On 08/06/2021 21:04, Ilya Dryomov wrote:
> > On Tue, Jun 8, 2021 at 7:11 PM Wido den Hollander <wido@xxxxxxxx> wrote:
> >>
> >> Hi,
> >>
> >> So I've been doing some tests with v16.2.4 with a 2TB Samsung PM983 SSD
> >> mounted under /mnt/rbd-cache
> >>
> >> rbd_persistent_cache_mode = ssd
> >> rbd_persistent_cache_size = 2G
> >> rbd_persistent_cache_path = /mnt/rbd-cache
> >> rbd_plugins = pwl_cache
> >>
> >> I tried both XFS and EXT4 as the filesystem.
> >>
> >> This however leads to fio or 'rbd bench' to crash:
> >>
> >> root@infra-138-b16-27:~# fio fio/rbd_rw_1.fio
> >> rbd_w_iodepth_1: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W)
> >> 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=1
> >> fio-3.1
> >> Starting 1 process
> >> Segmentation fault1)][13.3%][r=0KiB/s,w=14.7MiB/s][r=0,w=3768 IOPS][eta
> >> 00m:52s]
> >> root@infra-138-b16-27:~#
> >>
> >> (The IOps seem great!)
> >>
> >> My fio test is fairly simple:
> >>
> >> [global]
> >> ioengine=rbd
> >> clientname=admin
> >> pool=rbd
> >> rbdname=fio1
> >> invalidate=0
> >> bs=4k
> >> runtime=60
> >> direct=1
> >>
> >> [rbd_w_iodepth_1]
> >> rw=randwrite
> >> iodepth=1
> >>
> >> I have tried to trace it with gdb, but I didn't get further with my
> >> backtrace then:
> >>
> >> (gdb) bt
> >> #0  ContextWQ::process (ctx=0x7fffb8081480, this=0x7fffb8012470) at
> >> ./src/common/WorkQueue.h:556
> >> #1  ThreadPool::PointerWQ<Context>::_void_process (this=0x7fffb8012470,
> >> item=0x7fffb8081480, handle=...) at ./src/common/WorkQueue.h:341
> >> #2  0x00007fffec600912 in ThreadPool::worker (this=0x7fffb8012018,
> >> wt=<optimized out>) at ./src/common/WorkQueue.cc:117
> >> #3  0x00007fffec601801 in ThreadPool::WorkThread::entry (this=<optimized
> >> out>) at ./src/common/WorkQueue.h:395
> >> #4  0x00007ffff5c796db in start_thread (arg=0x7fffb17fa700) at
> >> pthread_create.c:463
> >> #5  0x00007ffff579e71f in clone () at
> >> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> >>
> >> Has anybody been able to use pwl_cache successfully?
> >
> > Hi Wido,
> >
> > Unfortunately "rbd_persistent_cache_mode = ssd" cache has shipped
> > rather broken.  This particular crash is most likely already fixed
> > in master, but there are a few more outstanding.  There is a dozen
> > of "[pwl ssd] ..." tickets in the rbd project, the fixes would be
> > backported to pacific once the ssd mode is stable enough.
> >
> > Until then, I would to stick to "rbd_persistent_cache_mode = rwl"
> > or avoid the pwl_cache plugin entirely.
> >
>
> I tried with rwl and am now using XFS with DAX enabled and it works.
>
> Performance-wise I see an improvement of 2x in terms of IOps with qd=1 bs=4k
>
> My kernel is reporting that my PM983 4TB NVMe I am using as a backing
> device is 100% util, but that seems off.
>
> Can we expect any fixes for this cache in .5 or .6?

Some may land in .5 but .6 would probably be the earliest for the ssd
mode to actually be usable.

Thanks,

                Ilya
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx