Re: 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?

Victor Hooi <victorhooi@xxxxxxxxx> · Sat, 9 Mar 2019 23:42:57 +1100

Hi Ahsley,

Right - so the 50% bandwidth is OK, I guess, but it was more the drop in IOPS that was concerning (hence the subject line about 200 IOPS) *sad face*.
That, and the Optane drives weren't exactly cheap, and I was hoping they would compensate for the overhead of Ceph.

At random read, each Optane drive is capable of 550000 IOPS (random read) and 500000 IOPS (random write). Yet we're seeing it drop to around 0.04% of that in testing (200 IOPS). Is that sort of drop in IOPS normal for Ceph?
Each node can take up to 8 x 2.5" drives. If I loaded up say 4 cheap SSDs in each (e.g. Intel S3700 SSD), instead of one Optane drive per node, would that have better performance with 4 x 3 = 12 drives? (Would I still put 4 OSDs per physical drive)? Or some way to supplement the Optane's with SSDs? (Although I would assume any SSD I get is going to be slower than an Optane drive).

Or are there tweaks I can do to either configuration, or our layout that could eke out more IOPS?

(This is going to be used for VM hosting, so IOPS is definitely a concern).

Thanks,
Victor

On Sat, Mar 9, 2019 at 9:27 PM Ashley Merrick <singapore@xxxxxxxxxxxxxx> wrote:
What kind of results are you expecting?
Looking at the specs they are "up to" 2000 Write, and 2500 Read, so your around 50-60% of the max up to speed, which I wouldn't say is to bad due to the fact CEPH / Bluestore has an overhead specially when using a single disk for DB & WAL & Content.

Remember CEPH scales with the amount of physical disks you have, as you only have 3 disks every piece of I/O is hitting all 3 disks, if you had 6 disks for example and still did replication of 3 then only 50% of I/O would be hitting each disks, therefore id expect to see performance jump.

On Sat, Mar 9, 2019 at 5:08 PM Victor Hooi <victorhooi@xxxxxxxxx> wrote:
Hi,
I'm setting up a 3-node Proxmox cluster with Ceph as the shared storage, based around Intel Optane 900P drives (which are meant to be the bee's knees), and I'm seeing pretty low IOPS/bandwidth.
3 nodes, each running a Ceph monitor daemon, and OSDs.
Node 1 has 48 GB of RAM and 10 cores (Intel 4114), and Node 2 and 3 have 32 GB of RAM and 4 cores (Intel E3-1230V6)
Each node has a Intel Optane 900p (480GB) NVMe dedicated for Ceph.
4 OSDs per node (total of 12 OSDs)
NICs are Intel X520-DA2, with 10GBASE-LR going to a Unifi US-XG-16.
First 10GB port is for Proxmox VM traffic, second 10GB port is for Ceph traffic.
I created a new Ceph pool specifically for benchmarking with 128 PGs.

Write results:
root@vwnode1:~# rados bench -p benchmarking 60 write -b 4M -t 16 --no-cleanup
....
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   60      16     12258     12242   816.055       788   0.0856726   0.0783458
Total time run:         60.069008
Total writes made:      12258
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     816.261
Stddev Bandwidth:       17.4584
Max bandwidth (MB/sec): 856
Min bandwidth (MB/sec): 780
Average IOPS:           204
Stddev IOPS:            4
Max IOPS:               214
Min IOPS:               195
Average Latency(s):     0.0783801
Stddev Latency(s):      0.0468404
Max latency(s):         0.437235
Min latency(s):         0.0177178

Sequential read results - I don't know why this only ran for 32 seconds?

root@vwnode1:~# rados bench -p benchmarking 60 seq -t 16
....
Total time run:       32.608549
Total reads made:     12258
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1503.65
Average IOPS:         375
Stddev IOPS:          22
Max IOPS:             410
Min IOPS:             326
Average Latency(s):   0.0412777
Max latency(s):       0.498116
Min latency(s):       0.00447062

Random read result:

root@vwnode1:~# rados bench -p benchmarking 60 rand -t 16
....
Total time run:       60.066384
Total reads made:     22819
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1519.59
Average IOPS:         379
Stddev IOPS:          21
Max IOPS:             424
Min IOPS:             320
Average Latency(s):   0.0408697
Max latency(s):       0.662955
Min latency(s):       0.00172077

I then cleaned-up with:

root@vwnode1:~# rados -p benchmarking cleanup
Removed 12258 objects

I then tested with another Ceph pool, with 512 PGs (originally created for Proxmox VMs) - results seem quite similar:

root@vwnode1:~# rados bench -p proxmox_vms 60 write -b 4M -t 16 --no-cleanup
....
Total time run:         60.041712
Total writes made:      12132
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     808.238
Stddev Bandwidth:       20.7444
Max bandwidth (MB/sec): 860
Min bandwidth (MB/sec): 744
Average IOPS:           202
Stddev IOPS:            5
Max IOPS:               215
Min IOPS:               186
Average Latency(s):     0.0791746
Stddev Latency(s):      0.0432707
Max latency(s):         0.42535
Min latency(s):         0.0200791

Sequential read result - once again, only ran for 32 seconds:

root@vwnode1:~# rados bench -p proxmox_vms 60 seq -t 16
....
Total time run:       31.249274
Total reads made:     12132
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1552.93
Average IOPS:         388
Stddev IOPS:          30
Max IOPS:             460
Min IOPS:             320
Average Latency(s):   0.0398702
Max latency(s):       0.481106
Min latency(s):       0.00461585

Random read result:

root@vwnode1:~# rados bench -p proxmox_vms 60 rand -t 16
...
Total time run:       60.088822
Total reads made:     23626
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1572.74
Average IOPS:         393
Stddev IOPS:          25
Max IOPS:             432
Min IOPS:             322
Average Latency(s):   0.0392854
Max latency(s):       0.693123
Min latency(s):       0.00178545

Cleanup:

root@vwnode1:~# rados -p proxmox_vms cleanup
Removed 12132 objects
root@vwnode1:~# rados df
POOL_NAME   USED   OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD     WR_OPS WR
proxmox_vms 169GiB   43396      0 130188                  0       0        0 909519 298GiB 619697 272GiB

total_objects    43396
total_used       564GiB
total_avail      768GiB
total_space      1.30TiB/

These results (800 MB/s writes, 1500 Mb/s reads, and 200 write IOPS, 400 read IOPS) seems incredibly low - particularly considering what the Optane 900p is meant to be capable of.

Is this in line with what you might expect on this hardware with Ceph though?

Or is there some way to find out the source of bottleneck?

Thanks,
Victor
_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com