Welcome to our "slow ceph" party :)))
However I have to note that:
1) 500000 iops is for 4 KB blocks. You're testing it with 4 MB ones.
That's kind of unfair comparison.
2) fio -ioengine=rbd is better than rados bench for testing.
3) You can't "compensate" for Ceph's overhead even by having infinitely
fast disks.
At its simplest, imagine that disk I/O takes X microseconds and Ceph's
overhead is Y for a single operation.
Suppose there is no parallelism. Then raw disk IOPS = 1000000/X and Ceph
IOPS = 1000000/(X+Y). Y is currently quite long, something around 400-800
microseconds or so. So the best IOPS number you can squeeze out of a
single client thread (a DBMS, for example) is 1000000/400 = only ~2500
iops.
Parallel iops are of course better, but still you won't get anything close
to 500000 iops from a single OSD. The expected number is around 15000.
Create multiple OSDs on a single NVMe and sacrifice your CPU usage if you
want better results.
--
With best regards,
Vitaliy Filippov
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com