Re: How to maximize the OSD effective queue depth in Ceph?

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Sat, 11 May 2019 13:50:04 +0200

On 10/05/2019 19:54, Mark Lehrer wrote:
I'm setting up a new Ceph cluster with fast SSD drives, and there is
one problem I want to make sure to address straight away:
comically-low OSD queue depths.

On the past several clusters I built, there was one major performance
problem that I never had time to really solve, which is this:
regardless of how much work the RBDs were being asked to do, the OSD
effective queue depth (as measured by iostat's "avgrq-sz" column)
never went above 3... even if I had multiple RBDs with queue depths in
the thousands.

This made sense back in the old days of spinning drives.  However, for
example with these particular drives and a 4K or 16K block size you
don't see maximum read performance until the queue depth gets to 50+.
At a queue depth of 4 the bandwidth is less than 20% what it is at
256.  The bottom line here is that Ceph performance is simply
embarrassing whenever the OSD effective queue depth is in single
digits.

On my last cluster, I spent a week or two researching and trying OSD
config parameters trying to increase the queue depth.  So far, the
only effective method I have seen to increase the effective OSD queue
depth is a gross hack - using multiple partitions per SSD to create
multiple OSDs.

My questions:

1) Is there anyone on this list who has solved this problem already?
On the performance articles I have seen, the authors don't show iostat
results (or any OSD effective queue depth numbers) so I can't really
tell.

2) If there isn't a good response to #1, is anyone else out there able
to do some experimentation to help figure this out?  All you would
need to do to get started is collect the output of this command while
a high-QD rbd test is happening: "iostat -mtxy 1" -- you should
collect it on all of the OSD servers as well as the client (you will
want to attach an RBD and talk to it via /dev/rbd0 otherwise iostat
probably won't see it).

3) If there is any technical reason why this is impossible, please let
me know before I get to far down this road... but because the multiple
partitions trick works so well I expect it must be possible somehow.

Thanks,
Mark
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

i assume you mean avgqu-sz (queue size) rather than avgrq-sz (request 
size). if so, what avgrq-sz do you get ? what kernel and io scheduler 
being used ?

It is not uncommon if the system is not well tuned for your workload, 
you may have a bottleneck in cpu running near 100% and your disks would 
be single digit % busy, the faster your disks are and the more disks you 
have, the less they will be busy if there is some cpu or network 
bottleneck. If so the queue depth on them will be very low.

It is also possible the cluster has good performance but the bottleneck 
is from the client(s) doing the test and is/are not fast enough to fully 
stress your cluster, hence your disks.

To know more, we need more numbers:
-How many SSDs/OSDs do you have, what is their raw device random 4k 
write sync iops ?
-How many hosts and cpu cores do you have ?
-How many nics and their speed ?
-What total iops do you get ? What params did you use for the 4k test ? 
is it random or sequential ?
-Do you use enough threads/queue depth to stress all your OSDs in 
parallel ?
-Run atop during the test, what cpu and disk % busy do you see on all 
hosts including clients ?
-How many clients do you use ? For a fast cluster you may need many 
clients to stress it, keep increasing clients until your numbers saturate.

/Maged
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com