-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 I've gotten about 3.2 GB/s with IPoIB on QDR, but it took a couple of weeks of tuning to get that rate. If your switch is at 2048 MTU, it is really hard to get it increased without an outage if I remember correctly. Connected mode is much easier to get higher MTUs, but it was a bit flaky with IPoIB (had to send several pings to get the connection established some times). This was all a couple of years ago now so my memory is a bit fuzzy. My current IB Ceph cluster is so small that doing any tuning is not going to help because the bottleneck is my disks and CPU. - ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Tue, Nov 24, 2015 at 10:26 AM, German Anders wrote: > Thanks a lot Robert for the explanation. I understand what you are saying > and I'm also excited to see more about IB with Ceph to get those performance > numbers up, and hopefully (hopefully soon) to see accelio working for > production. Regarding the HP IB switch we got 4 ports (uplinks) connected to > our IB SW, and internally the blades are connected through the backplane to > two ports so they used the total number of ports inside the Encl SW (16 > ports). The bonding that I've configured is active/backup, I didn't know > that active/active is possible with IPoIB. Also, the adapters that we got on > the ceph nodes (supermicro servers), are Mellanox Technologies MT27500 > Family [ConnectX-3], I also double check the port type configuration on the > IB SW and see that it's speed rate is 14.0 Gbps and also that the MTU > supported is 4096 and the current line rate is 56.0 Gbps. > > I've try almost all possible combinations and I'm not getting any > improvement that's more than 1.8 GB/s, so I was wondering if this is my top > limit speed with this kind of setup. > > Best, > > > German > > 2015-11-24 14:11 GMT-03:00 Robert LeBlanc : >> >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA256 >> >> I've had wildly different iperf results based on the version of the >> kernel, OFED and whether you are using datagram or connected mode as >> well as the MTU. You really have to just try all the different options >> to figure out what works the best. >> >> Please also remember that you will not get iSER performance out of >> Ceph at the moment (probably never), but the work being done will >> help. Even if you get the network transport optimially tuned, unless >> you have a massive Ceph cluster, you won't get the performance out the >> of the SSDs. I'm just as excited about Ceph on Infiniband, but I've >> had to just chill out and let the devs do their work. >> >> I've never had good experiences with active/active bonding on IPoIB. >> For two blades in the same chassis, you should get non-blocking line >> rate. For going out of the chassis, you will be limited by the number >> of ports you connect to the upstream switch (that is why there is >> usually the same number of uplink ports as there are blades so that >> you can do non-blocking, however HP has been selling switches with >> only half the uplinks making your oversubscription 2:1, it really >> depends on what you actually need). Between QDR and FDR, you should >> get QDR speed. Also be sure it is full FDR and not FDR-10 which is the >> same signal rate as QDR but with the new 64/66 encoding, it won't give >> you as much speed improvement as FDR and it can be difficult to tell >> which your adapter has if you don't research it. We thought we bought >> FDR cards only to find out later they were FDR-10. >> -----BEGIN PGP SIGNATURE----- >> Version: Mailvelope v1.2.3 >> Comment: https://www.mailvelope.com >> >> wsFcBAEBCAAQBQJWVJpCCRDmVDuy+mK58QAAEX4P/jFvdBzNob2xdftEkD2K >> rSB5i/Idmi7BAe1/JUzMF/t7l7zFXEpq96oLbt5NMbreOhCe6MitEApfhpWq >> dmt3IZYyUYVvXCxNGE/U7L58wi9DGPKJTWsigKScFtqjcQkIOlCh2VAHCmnE >> /WZBtlMnBsoibqq+zZsM4GEBwvPCwUwpGDKU13DhpuvmiN09jICEHH05wZzq >> ig/Ia309ioAZJ8PEKZ61kHUxAzTIMhwe1LV2jtlGQcJB4jMq7TQzOyizq0mQ >> 7DJTNNkMVpB9IEBCuOzzs/ByjKz+Tu31Jw2Y8R9MjtoDpOo+WQzzn6W4+NS0 >> jG0cFiumIBKVwoMJyXpQeS6UC0w7balHaXy+8F4SUa+J/9X5w4bH9MmlJBfh >> p81YDtNs7mQYKsuDOkjNe0BkthhHbdQThHn4A75j8Hqaltwr28UqL83ywCUJ >> SqTGkhRLyU9O74snPfG+T7hM4fIVpH7DS4ebmK7yvSVzwwuExPgwWhjvAsmt >> DRnXv0qd8UAIgza0VYTyZuElUC4V39wMe503tXo5By+NGKWzVNOWR1X0+46i >> Xq2zvZQzc9MPtGHMmnm1dkJ+d6imfLzTf099njZ+Wl1xbagnQiKbiwKL8T/k >> d3OClf514rV4i7FtwOoB8NQcUMUjaeZGmPVDhmVt7fRYz/+rARkN/jwXH4qG >> x/Dk >> =/88f >> -----END PGP SIGNATURE----- >> ---------------- >> Robert LeBlanc >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >> >> >> On Tue, Nov 24, 2015 at 8:24 AM, German Anders >> wrote: >> > Another test make between two HP blades with QDR (with bonding) >> > >> > e60-host01# iperf -s >> > ------------------------------------------------------------ >> > Server listening on TCP port 5001 >> > TCP window size: 85.3 KByte (default) >> > ------------------------------------------------------------ >> > [ 5] local 172.23.18.2 port 5001 connected with 172.23.18.1 port 41807 >> > [ 4] local 172.23.18.2 port 5001 connected with 172.23.18.1 port 41806 >> > [ 6] local 172.23.18.2 port 5001 connected with 172.23.18.1 port 41808 >> > [ 7] local 172.23.18.2 port 5001 connected with 172.23.18.1 port 41809 >> > [ ID] Interval Transfer Bandwidth >> > [ 5] 0.0-10.0 sec 2.64 GBytes 2.27 Gbits/sec >> > [ 4] 0.0-10.0 sec 2.64 GBytes 2.27 Gbits/sec >> > [ 6] 0.0-10.0 sec 3.58 GBytes 3.08 Gbits/sec >> > [ 7] 0.0-10.0 sec 3.57 GBytes 3.07 Gbits/sec >> > [SUM] 0.0-10.0 sec 12.4 GBytes 10.7 Gbits/sec >> > >> > e60-host02# iperf -c 172.23.18.2 -P 4 >> > >> > ------------------------------------------------------------ >> > Client connecting to 172.23.18.2, TCP port 5001 >> > TCP window size: 2.50 MByte (default) >> > ------------------------------------------------------------ >> > [ 3] local 172.23.18.1 port 41806 connected with 172.23.18.2 port 5001 >> > [ 5] local 172.23.18.1 port 41808 connected with 172.23.18.2 port 5001 >> > [ 4] local 172.23.18.1 port 41807 connected with 172.23.18.2 port 5001 >> > [ 6] local 172.23.18.1 port 41809 connected with 172.23.18.2 port 5001 >> > [ ID] Interval Transfer Bandwidth >> > [ 3] 0.0-10.0 sec 2.64 GBytes 2.27 Gbits/sec >> > [ 5] 0.0-10.0 sec 3.58 GBytes 3.08 Gbits/sec >> > [ 4] 0.0-10.0 sec 2.64 GBytes 2.27 Gbits/sec >> > [ 6] 0.0-10.0 sec 3.57 GBytes 3.07 Gbits/sec >> > [SUM] 0.0-10.0 sec 12.4 GBytes 10.7 Gbits/sec >> > >> > notice that also the blades are on the same enclosure. >> > >> > bonding configuration: >> > >> > alias bond-ib bonding options bonding mode=1 miimon=100 downdelay=100 >> > updelay=100 max_bonds=2 >> > >> > ## INFINIBAND CONF >> > >> > auto ib0 >> > iface ib0 inet manual >> > bond-master bond-ib >> > >> > auto ib1 >> > iface ib1 inet manual >> > bond-master bond-ib >> > >> > auto bond-ib >> > iface bond-ib inet static >> > address 172.23.xx.xx >> > netmask 255.255.xx.xx >> > slaves ib0 ib1 >> > bond_miimon 100 >> > bond_mode active-backup >> > pre-up echo connected > /sys/class/net/ib0/mode >> > pre-up echo connected > /sys/class/net/ib1/mode >> > pre-up /sbin/ifconfig ib0 mtu 65520 >> > pre-up /sbin/ifconfig ib1 mtu 65520 >> > pre-up modprobe bond-ib >> > pre-up /sbin/ifconfig bond-ib mtu 65520 >> > >> > >> > German >> > >> > 2015-11-24 11:51 GMT-03:00 Mark Nelson : >> >> >> >> Each port should be able to do 40Gb/s or 56Gb/s minus overhead and any >> >> PCIe or car related bottlenecks. IPoIB will further limit that, >> >> especially >> >> if you haven't done any kind of interrupt affinity tuning. >> >> >> >> Assuming these are mellanox cards you'll want to read this guide: >> >> >> >> >> >> >> >> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf >> >> >> >> For QDR I think the maximum throughput with IPoIB I've ever seen was >> >> about >> >> 2.7GB/s for a single port. Typically 2-2.5GB/s is probably about what >> >> you >> >> should expect for a well tuned setup. >> >> >> >> I'd still suggest doing iperf tests. It's really easy: >> >> >> >> "iperf -s" on one node to act as a server. >> >> >> >> "iperf -c -P " on the client >> >> >> >> This will give you an idea of how your network is doing. All-To-All >> >> network tests are also useful, in that sometimes network issues can >> >> crop up >> >> only when there's lots of traffic across many ports. We've seen this >> >> in lab >> >> environments, especially with bonded ethernet. >> >> >> >> Mark >> >> >> >> On 11/24/2015 07:22 AM, German Anders wrote: >> >>> >> >>> After doing some more in deep research and tune some parameters I've >> >>> gain a little bit more of performance: >> >>> >> >>> # fio --rw=randread --bs=1m --numjobs=4 --iodepth=32 --runtime=22 >> >>> --time_based --size=16777216k --loops=1 --ioengine=libaio --direct=1 >> >>> --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap >> >>> --group_reporting --exitall --name >> >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec >> >>> --filename=/mnt/e60host01vol1/test1 >> >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (g=0): rw=randread, >> >>> bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32 >> >>> ... >> >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (g=0): rw=randread, >> >>> bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32 >> >>> fio-2.1.3 >> >>> Starting 4 processes >> >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: Laying out IO >> >>> file(s) >> >>> (1 file(s) / 16384MB) >> >>> Jobs: 4 (f=4): [rrrr] [60.5% done] [*1714MB*/0KB/0KB /s] [1713/0/0 >> >>> iops] >> >>> >> >>> [eta 00m:15s] >> >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (groupid=0, jobs=4): >> >>> err= 0: pid=54857: Tue Nov 24 07:56:30 2015 >> >>> read : io=38699MB, bw=1754.2MB/s, iops=1754, runt= 22062msec >> >>> slat (usec): min=131, max=63426, avg=2249.87, stdev=4320.91 >> >>> clat (msec): min=2, max=321, avg=70.56, stdev=35.80 >> >>> lat (msec): min=2, max=321, avg=72.81, stdev=36.13 >> >>> clat percentiles (msec): >> >>> | 1.00th=[ 13], 5.00th=[ 24], 10.00th=[ 30], 20.00th=[ >> >>> 40], >> >>> | 30.00th=[ 50], 40.00th=[ 57], 50.00th=[ 65], 60.00th=[ >> >>> 75], >> >>> | 70.00th=[ 85], 80.00th=[ 98], 90.00th=[ 120], 95.00th=[ >> >>> 139], >> >>> | 99.00th=[ 178], 99.50th=[ 194], 99.90th=[ 229], 99.95th=[ >> >>> 247], >> >>> | 99.99th=[ 273] >> >>> bw (KB /s): min=301056, max=612352, per=25.01%, avg=449291.87, >> >>> stdev=54288.85 >> >>> lat (msec) : 4=0.11%, 10=0.61%, 20=2.11%, 50=27.87%, 100=50.92% >> >>> lat (msec) : 250=18.34%, 500=0.03% >> >>> cpu : usr=0.19%, sys=33.60%, ctx=66708, majf=0, minf=636 >> >>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=99.7%, >> >>> >=64=0.0% >> >>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >> >>> >=64=0.0% >> >>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >> >>> >=64=0.0% >> >>> issued : total=r=38699/w=0/d=0, short=r=0/w=0/d=0 >> >>> >> >>> Run status group 0 (all jobs): >> >>> READ: io=38699MB, aggrb=*1754.2MB/s*, minb=1754.2MB/s, >> >>> >> >>> maxb=1754.2MB/s, mint=22062msec, maxt=22062msec >> >>> >> >>> Disk stats (read/write): >> >>> rbd1: ios=77386/17, merge=0/122, ticks=3168312/500, >> >>> in_queue=3170168, >> >>> util=99.76% >> >>> >> >>> The thing is that this test was running from a 'HP Blade enclosure >> >>> with >> >>> QDR' so I think that if in QDR the max Throughput is around 3.2 GB/s >> >>> (I >> >>> guess that this number must be divided by the total number of ports, >> >>> in >> >>> this case 2, so a maximum of 1.6GB/s is the max of throughput that >> >>> I'll >> >>> get on a single port, is that correct? Also I made another test in >> >>> another host that also had FDR so (max throughput would be around 6.8 >> >>> GB/s), and if the same theory is valid, that would lead me to 3.4 GB/s >> >>> per port, but I'm not getting more than 1.4 - 1.6 GB/s, any ideas? >> >>> same >> >>> tuning on both servers. >> >>> >> >>> Basically I changed the scaling_governor of the cpufreq of all cpus to >> >>> 'performance' and then set the following values: >> >>> >> >>> sysctl -w net.ipv4.tcp_timestamps=0 >> >>> sysctl -w net.core.netdev_max_backlog=250000 >> >>> sysctl -w net.core.rmem_max=4194304 >> >>> sysctl -w net.core.wmem_max=4194304 >> >>> sysctl -w net.core.rmem_default=4194304 >> >>> sysctl -w net.core.wmem_default=4194304 >> >>> sysctl -w net.core.optmem_max=4194304 >> >>> sysctl -w net.ipv4.tcp_rmem="4096 87380 4194304" >> >>> sysctl -w net.ipv4.tcp_wmem="4096 65536 4194304" >> >>> sysctl -w net.ipv4.tcp_low_latency=1 >> >>> >> >>> >> >>> However, on the HP blade, there's no Intel CPUs like the other server, >> >>> so this kind of 'tuning' can't be done, so I left it as a default and >> >>> only changed the TCP networking part. >> >>> >> >>> Any comments or hint would be really appreciated. >> >>> >> >>> Thanks in advance, >> >>> >> >>> Best, >> >>> >> >>> >> >>> ** >> >>> >> >>> *German >> >>> >> >>> * >> >>> 2015-11-23 15:06 GMT-03:00 Robert LeBlanc > >>> >: >> >>> >> >>> >> >>> -----BEGIN PGP SIGNED MESSAGE----- >> >>> Hash: SHA256 >> >>> >> >>> Are you using unconnected mode or connected mode? With connected >> >>> mode >> >>> you can up your MTU to 64K which may help on the network side. >> >>> - ---------------- >> >>> Robert LeBlanc >> >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >> >>> >> >>> >> >>> On Mon, Nov 23, 2015 at 10:40 AM, German Anders wrote: >> >>> > Hi Mark, >> >>> > >> >>> > Thanks a lot for the quick response. Regarding the numbers that >> >>> you send me, >> >>> > they look REALLY nice. I've the following setup >> >>> > >> >>> > 4 OSD nodes: >> >>> > >> >>> > 2 x Intel Xeon E5-2650v2 @2.60Ghz >> >>> > 1 x Network controller: Mellanox Technologies MT27500 Family >> >>> [ConnectX-3] >> >>> > Dual-Port (1 for PUB and 1 for CLUS) >> >>> > 1 x SAS2308 PCI-Express Fusion-MPT SAS-2 >> >>> > 8 x Intel SSD DC S3510 800GB (1 OSD on each drive + journal on >> >>> the same >> >>> > drive, so 1:1 relationship) >> >>> > 3 x Intel SSD DC S3710 200GB (to be used maybe as a cache tier) >> >>> > 128GB RAM >> >>> > >> >>> > [0:0:0:0] disk ATA INTEL SSDSC2BA20 0110 /dev/sdc >> >>> > [0:0:1:0] disk ATA INTEL SSDSC2BA20 0110 /dev/sdd >> >>> > [0:0:2:0] disk ATA INTEL SSDSC2BA20 0110 /dev/sde >> >>> > [0:0:3:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdf >> >>> > [0:0:4:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdg >> >>> > [0:0:5:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdh >> >>> > [0:0:6:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdi >> >>> > [0:0:7:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdj >> >>> > [0:0:8:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdk >> >>> > [0:0:9:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdl >> >>> > [0:0:10:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdm >> >>> > >> >>> > sdf 8:80 0 745.2G 0 disk >> >>> > |-sdf1 8:81 0 740.2G 0 part >> >>> > /var/lib/ceph/osd/ceph-16 >> >>> > `-sdf2 8:82 0 5G 0 part >> >>> > sdg 8:96 0 745.2G 0 disk >> >>> > |-sdg1 8:97 0 740.2G 0 part >> >>> > /var/lib/ceph/osd/ceph-17 >> >>> > `-sdg2 8:98 0 5G 0 part >> >>> > sdh 8:112 0 745.2G 0 disk >> >>> > |-sdh1 8:113 0 740.2G 0 part >> >>> > /var/lib/ceph/osd/ceph-18 >> >>> > `-sdh2 8:114 0 5G 0 part >> >>> > sdi 8:128 0 745.2G 0 disk >> >>> > |-sdi1 8:129 0 740.2G 0 part >> >>> > /var/lib/ceph/osd/ceph-19 >> >>> > `-sdi2 8:130 0 5G 0 part >> >>> > sdj 8:144 0 745.2G 0 disk >> >>> > |-sdj1 8:145 0 740.2G 0 part >> >>> > /var/lib/ceph/osd/ceph-20 >> >>> > `-sdj2 8:146 0 5G 0 part >> >>> > sdk 8:160 0 745.2G 0 disk >> >>> > |-sdk1 8:161 0 740.2G 0 part >> >>> > /var/lib/ceph/osd/ceph-21 >> >>> > `-sdk2 8:162 0 5G 0 part >> >>> > sdl 8:176 0 745.2G 0 disk >> >>> > |-sdl1 8:177 0 740.2G 0 part >> >>> > /var/lib/ceph/osd/ceph-22 >> >>> > `-sdl2 8:178 0 5G 0 part >> >>> > sdm 8:192 0 745.2G 0 disk >> >>> > |-sdm1 8:193 0 740.2G 0 part >> >>> > /var/lib/ceph/osd/ceph-23 >> >>> > `-sdm2 8:194 0 5G 0 part >> >>> > >> >>> > >> >>> > $ rados bench -p rbd 20 write --no-cleanup -t 4 >> >>> > Maintaining 4 concurrent writes of 4194304 bytes for up to 20 >> >>> seconds or 0 >> >>> > objects >> >>> > Object prefix: benchmark_data_cibm01_1409 >> >>> > sec Cur ops started finished avg MB/s cur MB/s last >> >>> lat >> >>> avg lat >> >>> > 0 0 0 0 0 0 >> >>> - >> >>> 0 >> >>> > 1 4 121 117 467.894 468 >> >>> 0.0337203 >> >>> 0.0336809 >> >>> > 2 4 244 240 479.895 492 >> >>> 0.0304306 >> >>> 0.0330524 >> >>> > 3 4 372 368 490.559 512 >> >>> 0.0361914 >> >>> 0.0323822 >> >>> > 4 4 491 487 486.899 476 >> >>> 0.0346544 >> >>> 0.0327169 >> >>> > 5 4 587 583 466.302 384 >> >>> 0.110718 >> >>> 0.0342427 >> >>> > 6 4 701 697 464.575 456 >> >>> 0.0324953 >> >>> 0.0343136 >> >>> > 7 4 811 807 461.053 440 >> >>> 0.0400344 >> >>> 0.0345994 >> >>> > 8 4 923 919 459.412 448 >> >>> 0.0255677 >> >>> 0.0345767 >> >>> > 9 4 1032 1028 456.803 436 >> >>> 0.0309743 >> >>> 0.0349256 >> >>> > 10 4 1119 1115 445.917 348 >> >>> 0.229508 >> >>> 0.0357856 >> >>> > 11 4 1222 1218 442.826 412 >> >>> 0.0277902 >> >>> 0.0360635 >> >>> > 12 4 1315 1311 436.919 372 >> >>> 0.0303377 >> >>> 0.0365673 >> >>> > 13 4 1424 1420 436.842 436 >> >>> 0.0288001 >> >>> 0.03659 >> >>> > 14 4 1524 1520 434.206 400 >> >>> 0.0360993 >> >>> 0.0367697 >> >>> > 15 4 1632 1628 434.054 432 >> >>> 0.0296406 >> >>> 0.0366877 >> >>> > 16 4 1740 1736 433.921 432 >> >>> 0.0310995 >> >>> 0.0367746 >> >>> > 17 4 1836 1832 430.98 384 >> >>> 0.0250518 >> >>> 0.0370169 >> >>> > 18 4 1941 1937 430.366 420 >> >>> 0.027502 >> >>> 0.0371341 >> >>> > 19 4 2049 2045 430.448 432 >> >>> 0.0260257 >> >>> 0.0370807 >> >>> > 2015-11-23 12:10:58.587087min lat: 0.0229266 max lat: 0.27063 >> >>> avg >> >>> lat: >> >>> > 0.0373936 >> >>> > sec Cur ops started finished avg MB/s cur MB/s last >> >>> lat >> >>> avg lat >> >>> > 20 4 2141 2137 427.322 368 >> >>> 0.0351276 >> >>> 0.0373936 >> >>> > Total time run: 20.186437 >> >>> > Total writes made: 2141 >> >>> > Write size: 4194304 >> >>> > Bandwidth (MB/sec): 424.245 >> >>> > >> >>> > Stddev Bandwidth: 102.136 >> >>> > Max bandwidth (MB/sec): 512 >> >>> > Min bandwidth (MB/sec): 0 >> >>> > Average Latency: 0.0376536 >> >>> > Stddev Latency: 0.032886 >> >>> > Max latency: 0.27063 >> >>> > Min latency: 0.0229266 >> >>> > >> >>> > >> >>> > $ rados bench -p rbd 20 seq --no-cleanup -t 4 >> >>> > sec Cur ops started finished avg MB/s cur MB/s last >> >>> lat >> >>> avg lat >> >>> > 0 0 0 0 0 0 >> >>> - >> >>> 0 >> >>> > 1 4 394 390 1559.52 1560 >> >>> 0.0148888 >> >>> 0.0102236 >> >>> > 2 4 753 749 1496.68 1436 >> >>> 0.0129162 >> >>> 0.0106595 >> >>> > 3 4 1137 1133 1509.65 1536 >> >>> 0.0101854 >> >>> 0.0105731 >> >>> > 4 4 1526 1522 1521.17 1556 >> >>> 0.0122154 >> >>> 0.0103827 >> >>> > 5 4 1890 1886 1508.07 >> >>> 14560.00825445 >> >>> 0.0105908 >> >>> > Total time run: 5.675418 >> >>> > Total reads made: 2141 >> >>> > Read size: 4194304 >> >>> > Bandwidth (MB/sec): 1508.964 >> >>> > >> >>> > Average Latency: 0.0105951 >> >>> > Max latency: 0.211469 >> >>> > Min latency: 0.00603694 >> >>> > >> >>> > >> >>> > I'm not even close to those numbers that you are getting... :( >> >>> any ideas? or >> >>> > hints? Also I've configured NOOP as the scheduler for all the >> >>> SSD >> >>> disks. I >> >>> > don't know really what else to look for, in order to improve >> >>> performance and >> >>> > get some similar numbers from what you are getting >> >>> > >> >>> > >> >>> > Thanks in advance, >> >>> > >> >>> > Cheers, >> >>> > >> >>> > >> >>> > German >> >>> > >> >>> > 2015-11-23 13:32 GMT-03:00 Mark Nelson : >> >>> >> >> >>> >> Hi German, >> >>> >> >> >>> >> I don't have exactly the same setup, but on the ceph community >> >>> cluster I >> >>> >> have tests with: >> >>> >> >> >>> >> 4 nodes, each of which are configured in some tests with: >> >>> >> >> >>> >> 2 x Intel Xeon E5-2650 >> >>> >> 1 x Intel XL710 40GbE (currently limited to about 2.5GB/s >> >>> each) >> >>> >> 1 x Intel P3700 800GB (4 OSDs per card using 4 data and 4 >> >>> journal >> >>> >> partitions) >> >>> >> 64GB RAM >> >>> >> >> >>> >> With filestore, I can get an aggregate throughput of: >> >>> >> >> >>> >> 1MB randread: 8715.3MB/s >> >>> >> 4MB randread: 8046.2MB/s >> >>> >> >> >>> >> This is with 4 fio instances on the same nodes as the OSDs >> >>> using >> >>> the fio >> >>> >> librbd engine. >> >>> >> >> >>> >> A couple of things I would suggest trying: >> >>> >> >> >>> >> 1) See how rados bench does. This is an easy test and you can >> >>> see how >> >>> >> different the numbers look. >> >>> >> >> >>> >> 2) try fio with librbd to see if it might be a qemu >> >>> limitation. >> >>> >> >> >>> >> 3) Assuming you are using IPoIB, try some iperf tests to see >> >>> how >> >>> your >> >>> >> network is doing. >> >>> >> >> >>> >> Mark >> >>> >> >> >>> >> >> >>> >> On 11/23/2015 10:17 AM, German Anders wrote: >> >>> >>> >> >>> >>> Thanks a lot for the quick update Greg. This lead me to ask >> >>> if >> >>> there's >> >>> >>> anything out there to improve performance in an Infiniband >> >>> environment >> >>> >>> with Ceph. In the cluster that I mentioned earlier. I've >> >>> setup >> >>> 4 OSD >> >>> >>> server nodes nodes each with 8 OSD daemons running with 800x >> >>> Intel SSD >> >>> >>> DC S3710 disks (740.2G for OSD and 5G for Journal) and also >> >>> using IB FDR >> >>> >>> 56Gb/s for the PUB and CLUS network, and I'm getting the >> >>> following fio >> >>> >>> numbers: >> >>> >>> >> >>> >>> >> >>> >>> # fio --rw=randread --bs=1m --numjobs=4 --iodepth=32 >> >>> --runtime=22 >> >>> >>> --time_based --size=16777216k --loops=1 --ioengine=libaio >> >>> --direct=1 >> >>> >>> --invalidate=1 --fsync_on_close=1 --randrepeat=1 >> >>> --norandommap >> >>> >>> --group_reporting --exitall --name >> >>> >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec >> >>> >>> --filename=/mnt/rbd/test1 >> >>> >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (g=0): >> >>> rw=randread, >> >>> >>> bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32 >> >>> >>> ... >> >>> >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (g=0): >> >>> rw=randread, >> >>> >>> bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32 >> >>> >>> fio-2.1.3 >> >>> >>> Starting 4 processes >> >>> >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: Laying out >> >>> IO >> >>> file(s) >> >>> >>> (1 file(s) / 16384MB) >> >>> >>> Jobs: 4 (f=4): [rrrr] [33.8% done] [1082MB/0KB/0KB /s] >> >>> [1081/0/0 iops] >> >>> >>> [eta 00m:45s] >> >>> >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (groupid=0, >> >>> jobs=4): >> >>> >>> err= 0: pid=63852: Mon Nov 23 10:48:07 2015 >> >>> >>> read : io=21899MB, bw=988.23MB/s, iops=988, runt= >> >>> 22160msec >> >>> >>> slat (usec): min=192, max=186274, avg=3990.48, >> >>> stdev=7533.77 >> >>> >>> clat (usec): min=10, max=808610, avg=125099.41, >> >>> stdev=90717.56 >> >>> >>> lat (msec): min=6, max=809, avg=129.09, stdev=91.14 >> >>> >>> clat percentiles (msec): >> >>> >>> | 1.00th=[ 27], 5.00th=[ 38], 10.00th=[ 45], >> >>> 20.00th=[ >> >>> >>> 61], >> >>> >>> | 30.00th=[ 74], 40.00th=[ 85], 50.00th=[ 100], >> >>> 60.00th=[ >> >>> >>> 117], >> >>> >>> | 70.00th=[ 141], 80.00th=[ 174], 90.00th=[ 235], >> >>> 95.00th=[ >> >>> >>> 297], >> >>> >>> | 99.00th=[ 482], 99.50th=[ 578], 99.90th=[ 717], >> >>> 99.95th=[ >> >>> >>> 750], >> >>> >>> | 99.99th=[ 775] >> >>> >>> bw (KB /s): min=134691, max=335872, per=25.08%, >> >>> avg=253748.08, >> >>> >>> stdev=40454.88 >> >>> >>> lat (usec) : 20=0.01% >> >>> >>> lat (msec) : 10=0.02%, 20=0.27%, 50=12.90%, 100=36.93%, >> >>> 250=41.39% >> >>> >>> lat (msec) : 500=7.59%, 750=0.84%, 1000=0.05% >> >>> >>> cpu : usr=0.11%, sys=26.76%, ctx=39695, majf=0, >> >>> minf=405 >> >>> >>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.3%, >> >>> 32=99.4%, >> >>> >>> >=64=0.0% >> >>> >>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >> >>> 64=0.0%, >> >>> >>> >=64=0.0% >> >>> >>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, >> >>> 64=0.0%, >> >>> >>> >=64=0.0% >> >>> >>> issued : total=r=21899/w=0/d=0, short=r=0/w=0/d=0 >> >>> >>> >> >>> >>> Run status group 0 (all jobs): >> >>> >>> READ: io=21899MB, aggrb=988.23MB/s, minb=988.23MB/s, >> >>> >>> maxb=988.23MB/s, mint=22160msec, maxt=22160msec >> >>> >>> >> >>> >>> Disk stats (read/write): >> >>> >>> rbd1: ios=43736/163, merge=0/5, ticks=3189484/15276, >> >>> >>> in_queue=3214988, util=99.78% >> >>> >>> >> >>> >>> >> >>> >>> >> >>> >>> >> >>> >> >>> >> >>> ############################################################################################################################################################ >> >>> >>> >> >>> >>> >> >>> >>> # fio --rw=randread --bs=4m --numjobs=4 --iodepth=32 >> >>> --runtime=22 >> >>> >>> --time_based --size=16777216k --loops=1 --ioengine=libaio >> >>> --direct=1 >> >>> >>> --invalidate=1 --fsync_on_close=1 --randrepeat=1 >> >>> --norandommap >> >>> >>> --group_reporting --exitall --name >> >>> >>> dev-ceph-randread-4m-4thr-libaio-32iodepth-22sec >> >>> >>> --filename=/mnt/rbd/test2 >> >>> >>> >> >>> >>> fio-2.1.3 >> >>> >>> Starting 4 processes >> >>> >>> dev-ceph-randread-4m-4thr-libaio-32iodepth-22sec: Laying out >> >>> IO >> >>> file(s) >> >>> >>> (1 file(s) / 16384MB) >> >>> >>> Jobs: 4 (f=4): [rrrr] [28.7% done] [894.3MB/0KB/0KB /s] >> >>> [223/0/0 iops] >> >>> >>> [eta 00m:57s] >> >>> >>> dev-ceph-randread-4m-4thr-libaio-32iodepth-22sec: (groupid=0, >> >>> jobs=4): >> >>> >>> err= 0: pid=64654: Mon Nov 23 10:51:58 2015 >> >>> >>> read : io=18952MB, bw=876868KB/s, iops=214, runt= >> >>> 22132msec >> >>> >>> slat (usec): min=518, max=81398, avg=18576.88, >> >>> stdev=14840.55 >> >>> >>> clat (msec): min=90, max=1915, avg=570.37, stdev=166.51 >> >>> >>> lat (msec): min=123, max=1936, avg=588.95, stdev=169.19 >> >>> >>> clat percentiles (msec): >> >>> >>> | 1.00th=[ 258], 5.00th=[ 343], 10.00th=[ 383], >> >>> 20.00th=[ >> >>> >>> 437], >> >>> >>> | 30.00th=[ 482], 40.00th=[ 519], 50.00th=[ 553], >> >>> 60.00th=[ >> >>> >>> 594], >> >>> >>> | 70.00th=[ 627], 80.00th=[ 685], 90.00th=[ 775], >> >>> 95.00th=[ >> >>> >>> 865], >> >>> >>> | 99.00th=[ 1057], 99.50th=[ 1156], 99.90th=[ 1680], >> >>> 99.95th=[ >> >>> >>> 1860], >> >>> >>> | 99.99th=[ 1909] >> >>> >>> bw (KB /s): min= 5665, max=383251, per=24.61%, >> >>> avg=215755.74, >> >>> >>> stdev=61735.70 >> >>> >>> lat (msec) : 100=0.02%, 250=0.80%, 500=33.88%, >> >>> 750=53.31%, >> >>> >>> 1000=10.26% >> >>> >>> lat (msec) : 2000=1.73% >> >>> >>> cpu : usr=0.07%, sys=12.52%, ctx=32466, majf=0, >> >>> minf=372 >> >>> >>> IO depths : 1=0.1%, 2=0.2%, 4=0.3%, 8=0.7%, 16=1.4%, >> >>> 32=97.4%, >> >>> >>> >=64=0.0% >> >>> >>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >> >>> 64=0.0%, >> >>> >>> >=64=0.0% >> >>> >>> complete : 0=0.0%, 4=99.9%, 8=0.0%, 16=0.0%, 32=0.1%, >> >>> 64=0.0%, >> >>> >>> >=64=0.0% >> >>> >>> issued : total=r=4738/w=0/d=0, short=r=0/w=0/d=0 >> >>> >>> >> >>> >>> Run status group 0 (all jobs): >> >>> >>> READ: io=18952MB, aggrb=876868KB/s, minb=876868KB/s, >> >>> >>> maxb=876868KB/s, mint=22132msec, maxt=22132msec >> >>> >>> >> >>> >>> Disk stats (read/write): >> >>> >>> rbd1: ios=37721/177, merge=0/5, ticks=3075924/11408, >> >>> >>> in_queue=3097448, util=99.77% >> >>> >>> >> >>> >>> >> >>> >>> Can anyone share some results from a similar environment? >> >>> >>> >> >>> >>> Thanks in advance, >> >>> >>> >> >>> >>> Best, >> >>> >>> >> >>> >>> ** >> >>> >>> >> >>> >>> *German* >> >>> >>> >> >>> >>> 2015-11-23 13:08 GMT-03:00 Gregory Farnum >> >: >> >>> >>> >> >>> >>> On Mon, Nov 23, 2015 at 10:05 AM, German Anders >> >>> >>> > wrote: >> >>> >>> > Hi all, >> >>> >>> > >> >>> >>> > I want to know if there's any improvement or update >> >>> regarding ceph >> >>> >>> 0.94.5 >> >>> >>> > with accelio, I've an already configured cluster (with >> >>> no >> >>> data on >> >>> >>> it) and I >> >>> >>> > would like to know if there's a way to 'modify' the >> >>> cluster in >> >>> >>> order to use >> >>> >>> > accelio. Any info would be really appreciated. >> >>> >>> >> >>> >>> The XioMessenger is still experimental. As far as I know >> >>> it's not >> >>> >>> expected to be stable any time soon and I can't imagine >> >>> it >> >>> will be >> >>> >>> backported to Hammer even when done. >> >>> >>> -Greg >> >>> >>> >> >>> >>> >> >>> >>> >> >>> >>> >> >>> >>> _______________________________________________ >> >>> >>> ceph-users mailing list >> >>> >>> ceph-users@xxxxxxxxxxxxxx >> >>> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >>> >>> >> >>> >> _______________________________________________ >> >>> >> ceph-users mailing list >> >>> >> ceph-users@xxxxxxxxxxxxxx >> >>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >>> > >> >>> > >> >>> > >> >>> > _______________________________________________ >> >>> > ceph-users mailing list >> >>> > ceph-users@xxxxxxxxxxxxxx >> >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >>> > >> >>> >> >>> -----BEGIN PGP SIGNATURE----- >> >>> Version: Mailvelope v1.2.3 >> >>> Comment: https://www.mailvelope.com >> >>> >> >>> wsFcBAEBCAAQBQJWU1WqCRDmVDuy+mK58QAAo5cQALjuZB+dyjbcRDyScvj/ >> >>> qjurMqCHlScgG9U8CE4L6/E/QUfCNmdvE4KaeQC82oj/SplXYOuglTHJkUMg >> >>> KPyjb9jJs+ZyS560IoUB/l/XQZpO9WL+DNnSAg96Hpb3eG+G5jukW9/E/QHQ >> >>> aDjn/c1njEqUhxMAosUFZR58CxejyyI5Vr/SXX+oE6y2tCF31Z3KPiOVTOtj >> >>> BPIx74xpigXMSP+zaK4UelhjPzrRnefkN2sLpQS5uwJlOY1f35KoM3dX+LHO >> >>> 2BWpyrLUtL6ZzpalKr/QbaWko1VM109vjAoPZ3X82ig9DZp2DW8ZVX4abVcy >> >>> +Zyre4SCncKFJZcL9VkQHPJxRFhqXHC43mpSHIKmhuhmGVwr9ngiKGUY1Q7t >> >>> O0aks06KHfqSRxjWmuhtP0eMLwsH7gLAEqqtAjnIhRTCDDkhRdp/MdZJ7ftO >> >>> LHF9+Eqdp/KiVrGK7BX9zwVshr608bR4g7JCfK4/ukSHXOWFVR6GZ8jue85q >> >>> e6dWhHsdwrPt1QnSrfhnKjoMdhTpvPVzlxqo2jHDXEyE57RxW/zXr776HxcQ >> >>> cISj4zDZ0nGZ1F8w4DdB0ql8CpsCDAEoaNG0ZQPXcItyrHIB0lFOJYDi5m+4 >> >>> YqOCG8TWh7b28IbEEwwUSpx3pi2iyH0ObJZM5dgf62AOCKCEsixf+UguFVwd >> >>> /jdL >> >>> =6LtO >> >>> -----END PGP SIGNATURE----- >> >>> >> >>> >> > > > -----BEGIN PGP SIGNATURE----- Version: Mailvelope v1.2.3 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWVJ/DCRDmVDuy+mK58QAAku0P/0u8bDO/bAK9YxPHE0iP ZFOI+mz8QSZ7Omu/3DKlMBtj81Amv781j0jc6iIFBijYGUBeTlr8HPZgsBry Y5MsDXxrQPiboWSQ7kF8cOfOlZd+JmSSnmum6Gw9P7XzZAXfLwuEIvZitIHv ivHJPykj34A3ZdJByuXdQG929nIOLRKUKdXnIXOMrNQQWSwn4u/v4pGbYchN EIhjAFLN+maKAE+KXgARkwDpXKMw+//Tu435GzDGzwXND6m9Vk1JKsJ42qvv 3D5In0xtNuKayUDLwv0WCsQrkysomY/H+PgMoa2Nu8Uo9jQQqH36S/KMyq6e vjbhuwrJ1ZnWKtdhixCk5fC66D41kchOPmFqXwKBBczAWj2HfO8/naFRqX6G 8IePIkCZ80PVqRO4n8/IDp/JMMU6y6PfWhU+1VI/HHPaxqBvHM6RiaGLy+b7 F3wbhcM6WrFfLMn5jjhkBhL+A/s1Z+Zwzg19wVGFzVFuFj6Tzn+mgWvocIF3 GY1Ii3R5QR+z1IC6RJYHbf/jJgPiwnh5/sPy9WV/td2sQrrt2Bg4Bj8mNa9l 3adcWQnaZuATi9wOClGPI/R1mZrXLna3QnDTuxlibEYf/XpEi4iE96/tH6N1 92HwHpnWhO5roS0sug7YS3uqZ81EX5t+5SoX07Y6ZOln2i8f7TvhZbXaDB13 beNr =ZIWh -----END PGP SIGNATURE----- _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com