Re: RBD with PWL cache shows poor performance compared to cache device

Mark Nelson <mark.nelson@xxxxxxxxx> · Wed, 5 Jul 2023 09:18:45 -0500

On 7/4/23 10:39, Matthew Booth wrote:
On Tue, 4 Jul 2023 at 10:00, Matthew Booth <mbooth@xxxxxxxxxx> wrote:
On Mon, 3 Jul 2023 at 18:33, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
On Mon, Jul 3, 2023 at 6:58 PM Mark Nelson <mark.nelson@xxxxxxxxx> wrote:

On 7/3/23 04:53, Matthew Booth wrote:
On Thu, 29 Jun 2023 at 14:11, Mark Nelson <mark.nelson@xxxxxxxxx> wrote:
This container runs:
      fio --rw=write --ioengine=sync --fdatasync=1
--directory=/var/lib/etcd --size=100m --bs=8000 --name=etcd_perf
--output-format=json --runtime=60 --time_based=1

And extracts sync.lat_ns.percentile["99.000000"]
Matthew, do you have the rest of the fio output captured?  It would be interesting to see if it's just the 99th percentile that is bad or the PWL cache is worse in general.
Sure.

With PWL cache: https://paste.openstack.org/show/820504/
Without PWL cache: https://paste.openstack.org/show/b35e71zAwtYR2hjmSRtR/
With PWL cache, 'rbd_cache'=false:
https://paste.openstack.org/show/byp8ZITPzb3r9bb06cPf/
Also, how's the CPU usage client side?  I would be very curious to see
if unwindpmp shows anything useful (especially lock contention):

https://github.com/markhpc/uwpmp

Just attach it to the client-side process and start out with something
like 100 samples (more are better but take longer).  You can run it like:

./unwindpmp -n 100 -p <pid>
I've included the output in this gist:
https://gist.github.com/mdbooth/2d68b7e081a37e27b78fe396d771427d

That gist contains 4 runs: 2 with PWL enabled and 2 without, and also
a markdown file explaining the collection method.

Matt

Thanks Matt!  I looked through the output.  Looks like the symbols might
have gotten mangled.  I'm not an expert on the RBD client, but I don't
think we would really be calling into
rbd_group_snap_rollback_with_progress from
librbd::cache::pwl::ssd::WriteLogEntry::writeback_bl.  Was it possible
you used the libdw backend for unwindpmp?  libdw sometimes gives
strange/mangled callgraphs, but I haven't seen it before with
libunwind.  Hopefully Congmin Yin or Ilya can confirm if it's garbage.
So with that said, assuming we can trust these callgraphs at all, it
looks like it might be worth looking at the latency of the
AbstractWriteLog, librbd::cache::pwl::ssd::WriteLogEntry::writeback_bl,
and possibly usage of librados::v14_2_0::IoCtx::object_list.  On the
Hi Mark,

Both rbd_group_snap_rollback_with_progress and
librados::v14_2_0::IoCtx::object_list entries don't make sense to me,
so I'd say it's garbage.
Unfortunately I'm not at all familiar with this tool. Do you know how
it obtains its symbols? I didn't install any debuginfo packages, so I
was a bit surprised to see any symbols at all.
I installed the following debuginfo packages and re-ran the tests:
elfutils-debuginfod-client-0.189-2.fc38.x86_64
elfutils-debuginfod-client-devel-0.189-2.fc38.x86_64
ceph-debuginfo-17.2.6-3.fc38.x86_64
librbd1-debuginfo-17.2.6-3.fc38.x86_64
librados2-debuginfo-17.2.6-3.fc38.x86_64
qemu-debuginfo-7.2.1-2.fc38.x86_64
qemu-system-x86-core-debuginfo-7.2.1-2.fc38.x86_64
boost-debuginfo-1.78.0-11.fc38.x86_64

Note that unwindpmp now runs considerably slower (because it re-reads
debug symbols for each sample?), so I had to reduce the number of
samples to 500.

It basically just uses libunwind or libdw to unwind the stack over and 
over and then unwindpmp turns the resulting samples into a forward or 
reverse call graph.  The libunwind backend code is here:

https://github.com/markhpc/uwpmp/blob/master/src/tracer/unwind_tracer.cc

I'm sort of amazed that it gave you symbols without the debuginfo 
packages installed.  I'll need to figure out a way to prevent that.  
Having said that, your new traces look more accurate to me.  The thing 
that sticks out to me is the (slight?) amount of contention on the PWL 
m_lock in dispatch_deferred_writes, update_root_scheduled_ops, 
append_ops, append_sync_point(), etc.

I don't know if the contention around the m_lock is enough to cause an 
increase in 99% tail latency from 1.4ms to 5.2ms, but it's the first 
thing that jumps out at me.  There appears to be a large number of 
threads (each tp_pwl thread, the io_context_pool threads, the qemu 
thread, and the bstore_aio thread) that all appear to have potential to 
contend on that lock.  You could try dropping the number of tp_pwl 
threads from 4 to 1 and see if that changes anything.

Mark

I have updated the gist with the new results:
https://gist.github.com/mdbooth/2d68b7e081a37e27b78fe396d771427d

Thanks,
Matt

--
Best Regards,
Mark Nelson
Head of R&D (USA)

Clyso GmbH
p: +49 89 21552391 12
a: Loristraße 8 | 80335 München | Germany
w: https://clyso.com | e: mark.nelson@xxxxxxxxx

We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx