I'm setting up a new Ceph cluster with fast SSD drives, and there is one problem I want to make sure to address straight away: comically-low OSD queue depths. On the past several clusters I built, there was one major performance problem that I never had time to really solve, which is this: regardless of how much work the RBDs were being asked to do, the OSD effective queue depth (as measured by iostat's "avgrq-sz" column) never went above 3... even if I had multiple RBDs with queue depths in the thousands. This made sense back in the old days of spinning drives. However, for example with these particular drives and a 4K or 16K block size you don't see maximum read performance until the queue depth gets to 50+. At a queue depth of 4 the bandwidth is less than 20% what it is at 256. The bottom line here is that Ceph performance is simply embarrassing whenever the OSD effective queue depth is in single digits. On my last cluster, I spent a week or two researching and trying OSD config parameters trying to increase the queue depth. So far, the only effective method I have seen to increase the effective OSD queue depth is a gross hack - using multiple partitions per SSD to create multiple OSDs. My questions: 1) Is there anyone on this list who has solved this problem already? On the performance articles I have seen, the authors don't show iostat results (or any OSD effective queue depth numbers) so I can't really tell. 2) If there isn't a good response to #1, is anyone else out there able to do some experimentation to help figure this out? All you would need to do to get started is collect the output of this command while a high-QD rbd test is happening: "iostat -mtxy 1" -- you should collect it on all of the OSD servers as well as the client (you will want to attach an RBD and talk to it via /dev/rbd0 otherwise iostat probably won't see it). 3) If there is any technical reason why this is impossible, please let me know before I get to far down this road... but because the multiple partitions trick works so well I expect it must be possible somehow. Thanks, Mark _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com