Hi folks,
we are running a 3 node proxmox-cluster with - of corse - ceph :)
ceph version 12.2.12 (39cfebf25a7011204a9876d2950e4b28aba66d11) luminous (stable)
10G network. iperf reports almost 10G between all nodes.
We are using mixed standard SSDs (crucial / samsung). We are aware, that these disks can not delivery high iops or great throughput, but we have several of these clusters and this one is showing very poor performance.
NOW the strange fact:
When a specific node is rebooting, the throughput is acceptable.
But when the specific node is back, the results dropped by almost 100%.
2 NODES (one rebooting)
# rados bench -p scbench 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_pve3_1767693
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 55 39 155.992 156 0.0445665 0.257988
2 16 110 94 187.98 220 0.087097 0.291173
3 16 156 140 186.645 184 0.462171 0.286895
4 16 184 168 167.98 112 0.0235336 0.358085
5 16 210 194 155.181 104 0.112401 0.347883
6 16 252 236 157.314 168 0.134099 0.382159
7 16 287 271 154.838 140 0.0264864 0.40092
8 16 329 313 156.481 168 0.0609964 0.394753
9 16 364 348 154.649 140 0.244309 0.392331
10 16 416 400 159.981 208 0.277489 0.387424
Total time run: 10.335496
Total writes made: 417
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 161.386
Stddev Bandwidth: 37.8065
Max bandwidth (MB/sec): 220
Min bandwidth (MB/sec): 104
Average IOPS: 40
Stddev IOPS: 9
Max IOPS: 55
Min IOPS: 26
Average Latency(s): 0.396434
Stddev Latency(s): 0.428527
Max latency(s): 1.86968
Min latency(s): 0.020558
THIRD NODE ONLINE:
root@pve3:/# rados bench -p scbench 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_pve3_1771977
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 39 23 91.9943 92 0.21353 0.267249
2 16 46 30 59.9924 28 0.29527 0.268672
3 16 53 37 49.3271 28 0.122732 0.259731
4 16 53 37 36.9954 0 - 0.259731
5 16 53 37 29.5963 0 - 0.259731
6 16 87 71 47.3271 45.3333 0.241921 1.19831
7 16 106 90 51.4214 76 0.124821 1.07941
8 16 129 113 56.492 92 0.0314146 0.941378
9 16 142 126 55.9919 52 0.285536 0.871445
10 16 147 131 52.3925 20 0.354803 0.852074
Total time run: 10.138312
Total writes made: 148
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 58.3924
Stddev Bandwidth: 34.405
Max bandwidth (MB/sec): 92
Min bandwidth (MB/sec): 0
Average IOPS: 14
Stddev IOPS: 8
Max IOPS: 23
Min IOPS: 0
Average Latency(s): 1.08818
Stddev Latency(s): 1.55967
Max latency(s): 5.02514
Min latency(s): 0.0255947
Is here a single node faulty?
root@pve3:/# ceph status
cluster:
id: 138c857a-c4e6-4600-9320-9567011470d6
health: HEALTH_WARN
application not enabled on 1 pool(s) (thats just for benchmarking)
services:
mon: 3 daemons, quorum pve1,pve2,pve3
mgr: pve1(active), standbys: pve3, pve2
osd: 12 osds: 12 up, 12 in
data:
pools: 2 pools, 612 pgs
objects: 758.52k objects, 2.89TiB
usage: 8.62TiB used, 7.75TiB / 16.4TiB avail
pgs: 611 active+clean
1 active+clean+scrubbing+deep
io:
client: 4.99MiB/s rd, 1.36MiB/s wr, 678op/s rd, 105op/s wr
Thank you.
Stefan
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com