Re: RBD with PWL cache shows poor performance compared to cache device

Mark Nelson <mark.nelson@xxxxxxxxx> · Sun, 16 Jul 2023 11:23:19 -0500

On 7/10/23 11:19 AM, Matthew Booth wrote:

On Thu, 6 Jul 2023 at 12:54, Mark Nelson <mark.nelson@xxxxxxxxx> wrote:

On 7/6/23 06:02, Matthew Booth wrote:
On Wed, 5 Jul 2023 at 15:18, Mark Nelson <mark.nelson@xxxxxxxxx> wrote:
I'm sort of amazed that it gave you symbols without the debuginfo
packages installed.  I'll need to figure out a way to prevent that.
Having said that, your new traces look more accurate to me.  The thing
that sticks out to me is the (slight?) amount of contention on the PWL
m_lock in dispatch_deferred_writes, update_root_scheduled_ops,
append_ops, append_sync_point(), etc.

I don't know if the contention around the m_lock is enough to cause an
increase in 99% tail latency from 1.4ms to 5.2ms, but it's the first
thing that jumps out at me.  There appears to be a large number of
threads (each tp_pwl thread, the io_context_pool threads, the qemu
thread, and the bstore_aio thread) that all appear to have potential to
contend on that lock.  You could try dropping the number of tp_pwl
threads from 4 to 1 and see if that changes anything.
Will do. Any idea how to do that? I don't see an obvious rbd config option.

Thanks for looking into this,
Matt
you thanked me too soon...it appears to be hard-coded in, so you'll have
to do a custom build. :D

https://github.com/ceph/ceph/blob/main/src/librbd/cache/pwl/AbstractWriteLog.cc#L55-L56
Just to update: I have managed to test this today and it made no difference :(

Sorry for the late reply, just saw I had written this email but never 
actually sent it.

So... Nuts.  I was hoping for at least a little gain if you dropped it to 1.

In general, though, unless it's something egregious are we really
looking for something CPU-bound? Writes are 2 orders of magnitude
slower than the underlying local disk. This has to be caused by
something wildly inefficient.

In this case I would expect to be entirely latency bound.  It didn't 
look like PWL was working particularly hard, but to the extent that it 
was doing anything, it looked like it was spending a surprising amount 
of time dealing with that lock.  I still suspect that if your goal is to 
reduce 99% latency, you'll need to figure out what's causing little 
micro-stalls.

I have had a thought: the guest filesystem has 512 byte blocks, but
the pwl filesystem has 4k blocks (on a 4k disk). Given that the test
is of small writes, is there any chance that we're multiplying the
number of physical writes in some pathological manner?

Matt

--
Best Regards,
Mark Nelson
Head of R&D (USA)

Clyso GmbH
p: +49 89 21552391 12
a: Loristraße 8 | 80335 München | Germany
w: https://clyso.com | e: mark.nelson@xxxxxxxxx

We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx