Den tors 11 nov. 2021 kl 13:54 skrev Zakhar Kirpichenko <zakhar@xxxxxxxxx>: > I'm still trying to combat really bad read performance from HDD-backed > replicated pools, which is under 100 MB/s most of the time with 1 thread > and QD=1. I don't quite understand why the reads are that slow, i.e. much (doing a single-thread-single-client test on a cluster ..) > slower than a single HDD, but do understand that Ceph clients read a PG > from primary OSD only. > Since reads are immutable, is it possible to make Ceph clients read PG in a > RAID1-like fashion, i.e. if a PG has a primary OSD and two replicas, is it > possible to read all 3 OSDs in parallel for a 3x performance gain? This could possibly work out fine, for the above-mentioned test, but that test and the proposed solution is somewhat orthogonal to the problem ceph tries to solve, which is to serve IO to hundreds of clients (or programs or threads, or guests, or ..) using tens or hundreds of servers, while being able to scale to petabyte clusters without issue. In this case, you will probably not see a huge performance increase by sending over 3x as many operations because it means every drive now has 3x the work to do, or the queue of in-flight ops becomes 3x longer. Also, if you map an RBD like when you have the rbd image as an 40G qemu drive for a VM guest, it will get split into 4M pieces anyhow, so if the guest decides to read its drive from 0 -> end it will fire off 10000 read requests, spread out over the cluster and all OSDs on which the pool is placed, so you get load sharing. If the read is written in the worst possible way, like "read one meg, when you have this part, then move on to ask for one more meg, then wait for that to be delivered..." then yes, you will get the worst experience. Also, 100MB/s from a spin drive over a network isn't all that bad, given QD=1, since the double network latency will be there for linear reads for every block you ask for. If the turn-around times over the network is 1ms, then 1000 x IO-size is all you could hope for optimally at QD=1. If it instead sent off upto 10k requests to read 0->3.99M, next thread reads from 4->7.99M, next 8->11.99 and so on in parallel, it might well show you a bump in MB/s performance on a quiet cluster. But it also is a very non-normal io pattern for cluster users in my view. So while it is easy to visualize a huge improvement by asking more IO from what you imagine is idle drives, the normal cluster will be spreading tons and tons of IO all kinds of ways all the time, making the server IO queue deeper is probably not going to improve the sum of all IO that goes to clients. -- May the most significant bit of your life be positive. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx