We're evaluating persistent block providers for Kubernetes and looking at ceph at the moment.
We aren't seeing performance anywhere near what we expect.
I have a 50-node proof of concept cluster with 40 nodes available for storage and configured with rook/ceph. Each has 10GB nics and 8 x 1TB SSD's. (only 3 drives on each node have been allocated to ceph use)
We are testing with replicated pools of size 1 and 3. I've been doing fio tests in parallel (pod setup to run fio) and it seems to average aggregate bandwidth around 150 MB/sec.
I'm running the fio tests as follows:
direct=1, fsync=8|16|32|64, readwrite=write, blocksize=4k, numjobs=4|8, size=10G
Regardless of doing 1 stream or 40, the aggregate bandwidth as reported by "ceph -s" is ~150MB/sec:
I'm creating my pool with pg_num/pgp_num=1024|2048|4096
A baseline dd (100GB file using blocksize 1G) on these SSD's shows them capable of 1.6 GB/s.
I can't seem to find any limitations or bottlenecks on the nodes or the network.
Anyone have any idea where else I can look?
I'm new to ceph and it just seems like this should be pushing more I/O. I've dug thru a lot of performance tuning sites and have m implemented most of the suggestions.
# ceph -s
cluster:
id: 949a8caf-9a9b-4f09-8711-1d5158a65bd8
health: HEALTH_OK
services:
mon: 7 daemons, quorum rook-ceph-mon1,rook-ceph-mon3,rook-ceph-mon0,rook-ceph-mon5,rook-ceph-mon4,rook-ceph-mon2,rook-ceph-mon6
mgr: rook-ceph-mgr0(active)
osd: 123 osds: 123 up, 123 in
data:
pools: 1 pools, 2048 pgs
objects: 134k objects, 508 GB
usage: 1240 GB used, 110 TB / 112 TB avail
pgs: 2048 active+clean
io:
client: 138 MB/s wr, 0 op/s rd, 71163 op/s wr
Thanks for any help,
CC
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com