Re: RBD with PWL cache shows poor performance compared to cache device

Mark Nelson <mark.nelson@xxxxxxxxx> · Mon, 3 Jul 2023 11:57:09 -0500

On 7/3/23 04:53, Matthew Booth wrote:
On Thu, 29 Jun 2023 at 14:11, Mark Nelson <mark.nelson@xxxxxxxxx> wrote:
This container runs:
     fio --rw=write --ioengine=sync --fdatasync=1
--directory=/var/lib/etcd --size=100m --bs=8000 --name=etcd_perf
--output-format=json --runtime=60 --time_based=1

And extracts sync.lat_ns.percentile["99.000000"]
Matthew, do you have the rest of the fio output captured?  It would be interesting to see if it's just the 99th percentile that is bad or the PWL cache is worse in general.
Sure.

With PWL cache: https://paste.openstack.org/show/820504/
Without PWL cache: https://paste.openstack.org/show/b35e71zAwtYR2hjmSRtR/
With PWL cache, 'rbd_cache'=false:
https://paste.openstack.org/show/byp8ZITPzb3r9bb06cPf/

Also, how's the CPU usage client side?  I would be very curious to see
if unwindpmp shows anything useful (especially lock contention):

https://github.com/markhpc/uwpmp

Just attach it to the client-side process and start out with something
like 100 samples (more are better but take longer).  You can run it like:

./unwindpmp -n 100 -p <pid>
I've included the output in this gist:
https://gist.github.com/mdbooth/2d68b7e081a37e27b78fe396d771427d

That gist contains 4 runs: 2 with PWL enabled and 2 without, and also
a markdown file explaining the collection method.

Matt

Thanks Matt!  I looked through the output.  Looks like the symbols might 
have gotten mangled.  I'm not an expert on the RBD client, but I don't 
think we would really be calling into 
rbd_group_snap_rollback_with_progress from 
librbd::cache::pwl::ssd::WriteLogEntry::writeback_bl.  Was it possible 
you used the libdw backend for unwindpmp?  libdw sometimes gives 
strange/mangled callgraphs, but I haven't seen it before with 
libunwind.  Hopefully Congmin Yin or Ilya can confirm if it's garbage.

So with that said, assuming we can trust these callgraphs at all, it 
looks like it might be worth looking at the latency of the 
AbstractWriteLog, librbd::cache::pwl::ssd::WriteLogEntry::writeback_bl, 
and possibly usage of librados::v14_2_0::IoCtx::object_list.  On the 
QEMU side, possibly the latency of rbd_aio_flush in both cases.  Also 
it's possible we might have md_config_t get_val/set_val in the hot path 
somewhere though it looks minor.  If the 
rbd_group_snap_rollback_with_progress usage is real, it's significantly 
more prevalent in the PWL callgraphs.  Without knowing more about how 
the PWL cache works, I'm not sure if any of this is meaningful or not 
though.

Mark

Best Regards,
Mark Nelson
Head of R&D (USA)

Clyso GmbH
p: +49 89 21552391 12
a: Loristraße 8 | 80335 München | Germany
w: https://clyso.com | e: mark.nelson@xxxxxxxxx

We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx