Ceph I/O issues on all SSD cluster

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



OK, where to start. I have been debugging intensively the last two days, but can't seem to wrap my head around the performance issues we see in one of our two hyperconverged (ceph) proxmox clusters.

Let me introduce our two clusters and some of the debugging results.

*1. Cluster for internal purposes (performs as expected)*

3 x Supermicro servers with identical specs:
CPU: 1 x 7-7700K CPU @ 4.20GHz (1 Socket) 4 cores / 8 threads
RAM: 64 GB RAM
OSDs: 4 per node. 1 per SSD (Intel S4610) (12 OSDs in all)
1 x 10GbE RJ45 nic. MTU 9000 No bonding

_A total of 3 servers with a total of 12 OSDs_

Network:
1 x Unifi Switch 16 XG


*2. Cluster for VPS's for customers (performs much worse than internal)*
3 x Dell R630 with the following specs:
CPU: 2 x E5-2697 v3 @ 2.60GHz (2 Sockets) 28 cores / 56 threads
RAM 256GB
OSDs: 10 per node. 1 per SSD (Intel S4610)
1 x 10GbE SFP+ nic with 2 ports bonded via LACP (bond-xmit-hash-policy layer3+4). MTU 9000

2 x Supermicro X11SRM-VF with the following specs:
CPU 1 x 1 W-2145 CPU @ 3.70GHz (1 Socket) 8 cores / 16 threads
RAM: 256 GB
OSDs 8 per node. 1 per SSD (Intel S4610)
1 x 10GbE SFP+ nic with 2 ports bonded via LACP (bond-xmit-hash-policy layer3+4). MTU 9000

1 x Dell R630 with the following specs:
CPU 2 x CPU E5-2696 v4 @ 2.20GHz (2 Sockets) 44 cores / 88 threads
RAM: 256 GB
OSDs 8 per node. 1 per SSD (Intel S4610)
1 x 10GbE SFP+ nic with 2 ports bonded via LACP (bond-xmit-hash-policy layer3+4). MTU 9000

_A total of 6 servers with a total of 54 OSDs_


Network:
2 x Dell N4032F 10GbE SFP+ Switch connected with MLAG. Each node is connected to each switch.

To get a fair comparison i made the following fio tests one one host in each cluster on a rbd block device that i created:

*1. cluster:*
|fio --randrepeat=1 --ioengine=libaio --sync=1 --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randwrite test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.12
Starting 1 process
test: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=25.0MiB/s][w=6409 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=869177: Wed Nov 13 11:15:59 2019
write: IOPS=4158, BW=16.2MiB/s (17.0MB/s)(4096MiB/252126msec); 0 zone resets
bw ( KiB/s): min= 2075, max=32968, per=99.96%, avg=16627.60, stdev=9635.42, samples=504
iops : min= 518, max= 8242, avg=4156.88, stdev=2408.86, samples=504
cpu : usr=0.53%, sys=3.81%, ctx=109599, majf=0, minf=7
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=0,1048576,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
WRITE: bw=16.2MiB/s (17.0MB/s), 16.2MiB/s-16.2MiB/s (17.0MB/s-17.0MB/s), io=4096MiB (4295MB), run=252126-252126msec

Disk stats (read/write):
rbd0: ios=46/1221898, merge=0/1870438, ticks=25/4654920, in_queue=1980016, util=84.70%|

*2. cluster*
|fio --randrepeat=1 --ioengine=libaio --sync=1 --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randwrite test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [w(1)][99.9%][w=7024KiB/s][w=1756 IOPS][eta 00m:01s]
test: (groupid=0, jobs=1): err= 0: pid=794096: Wed Nov 13 11:25:56 2019
write: IOPS=1353, BW=5415KiB/s (5545kB/s)(4096MiB/774601msec); 0 zone resets
bw ( KiB/s): min= 40, max=30600, per=100.00%, avg=5420.24, stdev=3710.17, samples=1547
iops : min= 10, max= 7650, avg=1355.06, stdev=927.54, samples=1547
cpu : usr=0.16%, sys=1.19%, ctx=100028, majf=0, minf=8
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=0,1048576,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
WRITE: bw=5415KiB/s (5545kB/s), 5415KiB/s-5415KiB/s (5545kB/s-5545kB/s), io=4096MiB (4295MB), run=774601-774601msec

Disk stats (read/write):
rbd0: ios=0/1222639, merge=0/1784089, ticks=0/12124812, in_queue=9514280, util=45.14%|

And identical rados bench tests:

1. cluster:https://i.imgur.com/AdARCA6.png
2. clusterhttps://i.imgur.com/Di7mYQh.png

I have fio tested all disks. I have tested the network. I can't seem to find the reason why performance on my 2. cluster is relatively poor compared to 1. cluster.

--

Dennis Højgaard
Powerhosting Support

*t:*  +45 7222 4457 |*e:*  dh@xxxxxxxxxxxxxxx |*w:*  https://powerhosting.dk

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux