Re: RBD poor performance

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Thu, 28 Feb 2019 18:35:59 +0200

Hi Mark,

The 38K iops for single OSD is quite good. For the 4 OSDs, I think the 
55K iops may start to be impacted by network latency on the server node.

It will be interesting to know when using something more common like 3x 
replica, what additional amplification factor we see over the replica count.

Maged

On 28/02/2019 01:22, Mark Nelson wrote:
FWIW, I've got recent tests of a fairly recent master build 
(14.0.1-3118-gd239c2a) showing a single OSD hitting ~33-38K 4k 
randwrite IOPS with 3 client nodes running fio (io_depth = 32) both 
with RBD and with CephFS.  The OSD node had older gen CPUs (Xeon 
E5-2650 v3) and NVMe drives (Intel P3700).  The OSD process and 
threads were pinned to run on the first socket.  It took between 5-7 
cores to pull off that throughput though.

Jumping up to 4 OSDs in the node (no replication) improved aggregate 
throughput to ~54-55K IOPS with ~15 cores used, so 13-14K IOPS per OSD 
with around 3.5-4 cores each on average.  IE with more OSDs running on 
the same socket competing for cores, the throughput per OSD went down 
and the IOPS/core rate went down too.  With NVMe, you are likely best 
off when multiple OSD processes aren't competing with each other for 
cores and can mostly just run on a specific set of cores without 
contention. I'd expect that numa pinning each OSD process to specific 
cores with enough cores to satisfy the OSD might help.  (Nick Fisk 
also showed a while back that forcing the CPU to not drop into 
low-power C/P states can help dramatically as well).

Mark

On 2/27/19 4:30 PM, Vitaliy Filippov wrote:
By "maximum write iops of an osd" I mean total iops divided by the 
number of OSDs. For example, an expensive setup from Micron 
(https://www.micron.com/about/blog/2018/april/micron-9200-max-red-hat-ceph-storage-30-reference-architecture-block-performance) 
has got only 8750 peak write iops per an NVMe. These exact NVMes they 
used are rated for 260000+ iops when connected directly :). CPU is a 
real bottleneck. The need for a Seastar-based rewrite is not a joke! :)

Total iops is the number coming from a test like:

fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=128 
-rw=randwrite -pool=<your_pool> -runtime=60 -rbdname=testimg

...or from several such jobs run in parallel each over a separate RBD 
image.

This is a "random write bandwidth" test, and, in fact, it's not the 
most useful one - the single-thread latency usually does matter more 
than just total bandwidth. To test for it, run:

fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=1 
-rw=randwrite -pool=<your_pool> -runtime=60 -rbdname=testimg

You'll get a pretty low number (< 100 for HDD clusters, 500-1000 for 
SSD clusters). It's as expected that it's low. Everything above 1000 
iops (< 1ms latency, single-thread iops = 1 / avg latency) is hard to 
achieve with Ceph no matter what disks you're using. Also 
single-thread latency does not depend on the number of OSDs in the 
cluster, because the workload is not parallel.

However you can also test iops of single OSDs by creating a pool with 
size=1 and using a custom benchmark tool we've made with our 
colleagues from a russian Ceph chat... we can publish it here a short 
time later if you want :).

At some point I would expect the cpu to be the bottleneck. They have
always been saying this here for better latency get fast cpu's.
Would be nice to know what GHz you are testing, and how that scales. 
Rep
1-3, erasure propably also takes a hit.
How do you test maximum iops of the osd? (Just curious, so I can test
mine)

I have posted here a while ago a cephfs test on ssd rep 1. that was
performing nowhere near native, asking if this was normal. But never 
got
a response to it. I can remember that they send everyone a questionaire
and asked if they should focus on performance more, now I wished I
checked that box ;)

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Maged Mokhtar
CEO PetaSAN
4 Emad El Deen Kamel
Cairo 11371, Egypt
www.petasan.org
+201006979931
skype: maged.mokhtar

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com