Hi,
I have retested with 4K blocks - results are below.
I am currently using 4 OSDs per Optane 900P drive. This was based on some posts I found on Proxmox Forums, and what seems to be "tribal knowledge" there.
I also saw this presentation, which mentions on page 14:
2-4 OSDs/NVMe SSD and 4-6 NVMe SSDs per node are sweet spots
Has anybody done much testing with pure Optane drives for Ceph? (Paper above seems to use them mixed with traditional SSDs).
Would increasing the number of OSDs help in this scenario? I am happy to try that - I assume I will need to blow away all the existing OSDs/Ceph setup and start again, of course.
Here are the rados bench results with 4K - the write IOPS are still a tad short of 15,000 - is that what I should be aiming for?
Write result:
# rados bench -p proxmox_vms 60 write -b 4K -t 16 --no-cleanupTotal time run: 60.001016Total writes made: 726749Write size: 4096Object size: 4096Bandwidth (MB/sec): 47.3136Stddev Bandwidth: 2.16408Max bandwidth (MB/sec): 48.7344Min bandwidth (MB/sec): 38.5078Average IOPS: 12112Stddev IOPS: 554Max IOPS: 12476Min IOPS: 9858Average Latency(s): 0.00132019Stddev Latency(s): 0.000670617Max latency(s): 0.065541Min latency(s): 0.000689406
Sequential read result:
# rados bench -p proxmox_vms 60 seq -t 16Total time run: 17.098593Total reads made: 726749Read size: 4096Object size: 4096Bandwidth (MB/sec): 166.029Average IOPS: 42503Stddev IOPS: 218Max IOPS: 42978Min IOPS: 42192Average Latency(s): 0.000369021Max latency(s): 0.00543175Min latency(s): 0.000170024
Random read result:
# rados bench -p proxmox_vms 60 rand -t 16Total time run: 60.000282Total reads made: 2708799Read size: 4096Object size: 4096Bandwidth (MB/sec): 176.353Average IOPS: 45146Stddev IOPS: 310Max IOPS: 45754Min IOPS: 44506Average Latency(s): 0.000347637Max latency(s): 0.00457886Min latency(s): 0.000138381
I am happy to try with fio -ioengine =rbd (the reason I use rados bench is because that is what was used in the Proxmox Ceph benchmark paper) however, is there a common community-suggested starting command line that makes it easy to compare results? (fio seems quite complex in terms of options).
Thanks,
Victor
On Sun, Mar 10, 2019 at 6:15 AM Vitaliy Filippov <vitalif@xxxxxxxxxx> wrote:
Welcome to our "slow ceph" party :)))
However I have to note that:
1) 500000 iops is for 4 KB blocks. You're testing it with 4 MB ones.
That's kind of unfair comparison.
2) fio -ioengine=rbd is better than rados bench for testing.
3) You can't "compensate" for Ceph's overhead even by having infinitely
fast disks.
At its simplest, imagine that disk I/O takes X microseconds and Ceph's
overhead is Y for a single operation.
Suppose there is no parallelism. Then raw disk IOPS = 1000000/X and Ceph
IOPS = 1000000/(X+Y). Y is currently quite long, something around 400-800
microseconds or so. So the best IOPS number you can squeeze out of a
single client thread (a DBMS, for example) is 1000000/400 = only ~2500
iops.
Parallel iops are of course better, but still you won't get anything close
to 500000 iops from a single OSD. The expected number is around 15000.
Create multiple OSDs on a single NVMe and sacrifice your CPU usage if you
want better results.
--
With best regards,
Vitaliy Filippov
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com