Re: Pacific: parallel PG reads?

Janne Johansson <icepic.dz@xxxxxxxxx> · Thu, 11 Nov 2021 14:26:45 +0100

Den tors 11 nov. 2021 kl 13:54 skrev Zakhar Kirpichenko <zakhar@xxxxxxxxx>:
> I'm still trying to combat really bad read performance from HDD-backed
> replicated pools, which is under 100 MB/s most of the time with 1 thread
> and QD=1. I don't quite understand why the reads are that slow, i.e. much

(doing a single-thread-single-client test on a cluster ..)

> slower than a single HDD, but do understand that Ceph clients read a PG
> from primary OSD only.
> Since reads are immutable, is it possible to make Ceph clients read PG in a
> RAID1-like fashion, i.e. if a PG has a primary OSD and two replicas, is it
> possible to read all 3 OSDs in parallel for a 3x performance gain?

This could possibly work out fine, for the above-mentioned test, but
that test and the proposed solution is somewhat orthogonal to the
problem ceph tries to solve, which is to serve IO to hundreds of
clients (or programs or threads, or guests, or ..) using tens or
hundreds of servers, while being able to scale to petabyte clusters
without issue. In this case, you will probably not see a huge
performance increase by sending over 3x as many operations because it
means every drive now has 3x the work to do, or the queue of in-flight
ops becomes 3x longer.

Also, if you map an RBD like when you have the rbd image as an 40G
qemu drive for a VM guest, it will get split into 4M pieces anyhow, so
if the guest decides to read its drive from 0 -> end it will fire off
10000 read requests, spread out over the cluster and all OSDs on which
the pool is placed, so you get load sharing.

If the read is written in the worst possible way, like "read one meg,
when you have this part, then move on to ask for one more meg, then
wait for that to be delivered..." then yes, you will get the worst
experience. Also, 100MB/s from a spin drive over a network isn't all
that bad, given QD=1, since the double network latency will be there
for linear reads for every block you ask for. If the turn-around times
over the network is 1ms, then 1000 x IO-size is all you could hope for
optimally at QD=1.

If it instead sent off upto 10k requests to read 0->3.99M, next thread
reads from 4->7.99M, next 8->11.99 and so on in parallel, it might
well show you a bump in MB/s performance on a quiet cluster. But it
also is a very non-normal io pattern for cluster users in my view.

So while it is easy to visualize a huge improvement by asking more IO
from what you imagine is idle drives, the normal cluster will be
spreading tons and tons of IO all kinds of ways all the time, making
the server IO queue deeper is probably not going to improve the sum of
all IO that goes to clients.

-- 
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx