Re: Ceph 0.94.5 with accelio

German Anders <ganders@xxxxxxxxxxxx> · Tue, 24 Nov 2015 14:26:11 -0300

Thanks a lot Robert for the explanation. I understand what you are saying and I'm also excited to see more about IB with Ceph to get those performance numbers up, and hopefully (hopefully soon) to see accelio working for production. Regarding the HP IB switch we got 4 ports (uplinks) connected to our IB SW, and internally the blades are connected through the backplane to two ports so they used the total number of ports inside the Encl SW (16 ports). The bonding that I've configured is active/backup, I didn't know that active/active is possible with IPoIB. Also, the adapters that we got on the ceph nodes (supermicro servers), are Mellanox Technologies MT27500 Family [ConnectX-3], I also double check the port type configuration on the IB SW and see that it's speed rate is 14.0 Gbps and also that the MTU supported is 4096 and the current line rate is 56.0 Gbps.

I've try almost all possible combinations and I'm not getting any improvement that's more than 1.8 GB/s, so I was wondering if this is my top limit speed with this kind of setup.

Best,

German

2015-11-24 14:11 GMT-03:00 Robert LeBlanc <robert@xxxxxxxxxxxxx>:
-----BEGIN PGP SIGNED MESSAGE-----

Hash: SHA256

I've had wildly different iperf results based on the version of the

kernel, OFED and whether you are using datagram or connected mode as

well as the MTU. You really have to just try all the different options

to figure out what works the best.

Please also remember that you will not get iSER performance out of

Ceph at the moment (probably never), but the work being done will

help. Even if you get the network transport optimially tuned, unless

you have a massive Ceph cluster, you won't get the performance out the

of the SSDs. I'm just as excited about Ceph on Infiniband, but I've

had to just chill out and let the devs do their work.

I've never had good experiences with active/active bonding on IPoIB.

For two blades in the same chassis, you should get non-blocking line

rate. For going out of the chassis, you will be limited by the number

of ports you connect to the upstream switch (that is why there is

usually the same number of uplink ports as there are blades so that

you can do non-blocking, however HP has been selling switches with

only half the uplinks making your oversubscription 2:1, it really

depends on what you actually need). Between QDR and FDR, you should

get QDR speed. Also be sure it is full FDR and not FDR-10 which is the

same signal rate as QDR but with the new 64/66 encoding, it won't give

you as much speed improvement as FDR and it can be difficult to tell

which your adapter has if you don't research it. We thought we bought

FDR cards only to find out later they were FDR-10.

-----BEGIN PGP SIGNATURE-----

Version: Mailvelope v1.2.3

Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWVJpCCRDmVDuy+mK58QAAEX4P/jFvdBzNob2xdftEkD2K

rSB5i/Idmi7BAe1/JUzMF/t7l7zFXEpq96oLbt5NMbreOhCe6MitEApfhpWq

dmt3IZYyUYVvXCxNGE/U7L58wi9DGPKJTWsigKScFtqjcQkIOlCh2VAHCmnE

/WZBtlMnBsoibqq+zZsM4GEBwvPCwUwpGDKU13DhpuvmiN09jICEHH05wZzq

ig/Ia309ioAZJ8PEKZ61kHUxAzTIMhwe1LV2jtlGQcJB4jMq7TQzOyizq0mQ

7DJTNNkMVpB9IEBCuOzzs/ByjKz+Tu31Jw2Y8R9MjtoDpOo+WQzzn6W4+NS0

jG0cFiumIBKVwoMJyXpQeS6UC0w7balHaXy+8F4SUa+J/9X5w4bH9MmlJBfh

p81YDtNs7mQYKsuDOkjNe0BkthhHbdQThHn4A75j8Hqaltwr28UqL83ywCUJ

SqTGkhRLyU9O74snPfG+T7hM4fIVpH7DS4ebmK7yvSVzwwuExPgwWhjvAsmt

DRnXv0qd8UAIgza0VYTyZuElUC4V39wMe503tXo5By+NGKWzVNOWR1X0+46i

Xq2zvZQzc9MPtGHMmnm1dkJ+d6imfLzTf099njZ+Wl1xbagnQiKbiwKL8T/k

d3OClf514rV4i7FtwOoB8NQcUMUjaeZGmPVDhmVt7fRYz/+rARkN/jwXH4qG

x/Dk

=/88f

-----END PGP SIGNATURE-----

----------------

Robert LeBlanc

PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Tue, Nov 24, 2015 at 8:24 AM, German Anders <ganders@xxxxxxxxxxxx> wrote:

> Another test make between two HP blades with QDR (with bonding)

>

> e60-host01# iperf -s

> ------------------------------------------------------------

> Server listening on TCP port 5001

> TCP window size: 85.3 KByte (default)

> ------------------------------------------------------------

> [  5] local 172.23.18.2 port 5001 connected with 172.23.18.1 port 41807

> [  4] local 172.23.18.2 port 5001 connected with 172.23.18.1 port 41806

> [  6] local 172.23.18.2 port 5001 connected with 172.23.18.1 port 41808

> [  7] local 172.23.18.2 port 5001 connected with 172.23.18.1 port 41809

> [ ID] Interval       Transfer     Bandwidth

> [  5]  0.0-10.0 sec  2.64 GBytes  2.27 Gbits/sec

> [  4]  0.0-10.0 sec  2.64 GBytes  2.27 Gbits/sec

> [  6]  0.0-10.0 sec  3.58 GBytes  3.08 Gbits/sec

> [  7]  0.0-10.0 sec  3.57 GBytes  3.07 Gbits/sec

> [SUM]  0.0-10.0 sec  12.4 GBytes  10.7 Gbits/sec

>

> e60-host02# iperf -c 172.23.18.2 -P 4

>

> ------------------------------------------------------------

> Client connecting to 172.23.18.2, TCP port 5001

> TCP window size: 2.50 MByte (default)

> ------------------------------------------------------------

> [  3] local 172.23.18.1 port 41806 connected with 172.23.18.2 port 5001

> [  5] local 172.23.18.1 port 41808 connected with 172.23.18.2 port 5001

> [  4] local 172.23.18.1 port 41807 connected with 172.23.18.2 port 5001

> [  6] local 172.23.18.1 port 41809 connected with 172.23.18.2 port 5001

> [ ID] Interval       Transfer     Bandwidth

> [  3]  0.0-10.0 sec  2.64 GBytes  2.27 Gbits/sec

> [  5]  0.0-10.0 sec  3.58 GBytes  3.08 Gbits/sec

> [  4]  0.0-10.0 sec  2.64 GBytes  2.27 Gbits/sec

> [  6]  0.0-10.0 sec  3.57 GBytes  3.07 Gbits/sec

> [SUM]  0.0-10.0 sec  12.4 GBytes  10.7 Gbits/sec

>

> notice that also the blades are on the same enclosure.

>

> bonding configuration:

>

> alias bond-ib bonding options bonding mode=1 miimon=100 downdelay=100

> updelay=100 max_bonds=2

>

> ## INFINIBAND CONF

>

> auto ib0

> iface ib0 inet manual

>         bond-master bond-ib

>

> auto ib1

> iface ib1 inet manual

>         bond-master bond-ib

>

> auto bond-ib

> iface bond-ib inet static

>         address 172.23.xx.xx

>         netmask 255.255.xx.xx

>         slaves ib0 ib1

>         bond_miimon 100

>         bond_mode active-backup

>         pre-up echo connected > /sys/class/net/ib0/mode

>         pre-up echo connected > /sys/class/net/ib1/mode

>         pre-up /sbin/ifconfig ib0 mtu 65520

>         pre-up /sbin/ifconfig ib1 mtu 65520

>         pre-up modprobe bond-ib

>         pre-up /sbin/ifconfig bond-ib mtu 65520

>

>

> German

>

> 2015-11-24 11:51 GMT-03:00 Mark Nelson <mnelson@xxxxxxxxxx>:

>>

>> Each port should be able to do 40Gb/s or 56Gb/s minus overhead and any

>> PCIe or car related bottlenecks.  IPoIB will further limit that, especially

>> if you haven't done any kind of interrupt affinity tuning.

>>

>> Assuming these are mellanox cards you'll want to read this guide:

>>

>>

>> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf

>>

>> For QDR I think the maximum throughput with IPoIB I've ever seen was about

>> 2.7GB/s for a single port.  Typically 2-2.5GB/s is probably about what you

>> should expect for a well tuned setup.

>>

>> I'd still suggest doing iperf tests.  It's really easy:

>>

>> "iperf -s" on one node to act as a server.

>>

>> "iperf -c <server ip> -P <num connections, ie: 4>" on the client

>>

>> This will give you an idea of how your network is doing.  All-To-All

>> network tests are also useful, in that sometimes network issues can crop up

>> only when there's lots of traffic across many ports.  We've seen this in lab

>> environments, especially with bonded ethernet.

>>

>> Mark

>>

>> On 11/24/2015 07:22 AM, German Anders wrote:

>>>

>>> After doing some more in deep research and tune some parameters I've

>>> gain a little bit more of performance:

>>>

>>> # fio --rw=randread --bs=1m --numjobs=4 --iodepth=32 --runtime=22

>>> --time_based --size=16777216k --loops=1 --ioengine=libaio --direct=1

>>> --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap

>>> --group_reporting --exitall --name

>>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec

>>> --filename=/mnt/e60host01vol1/test1

>>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (g=0): rw=randread,

>>> bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32

>>> ...

>>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (g=0): rw=randread,

>>> bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32

>>> fio-2.1.3

>>> Starting 4 processes

>>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: Laying out IO file(s)

>>> (1 file(s) / 16384MB)

>>> Jobs: 4 (f=4): [rrrr] [60.5% done] [*1714MB*/0KB/0KB /s] [1713/0/0 iops]

>>>

>>> [eta 00m:15s]

>>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (groupid=0, jobs=4):

>>> err= 0: pid=54857: Tue Nov 24 07:56:30 2015

>>>    read : io=38699MB, bw=1754.2MB/s, iops=1754, runt= 22062msec

>>>      slat (usec): min=131, max=63426, avg=2249.87, stdev=4320.91

>>>      clat (msec): min=2, max=321, avg=70.56, stdev=35.80

>>>       lat (msec): min=2, max=321, avg=72.81, stdev=36.13

>>>      clat percentiles (msec):

>>>       |  1.00th=[   13],  5.00th=[   24], 10.00th=[   30], 20.00th=[

>>> 40],

>>>       | 30.00th=[   50], 40.00th=[   57], 50.00th=[   65], 60.00th=[

>>> 75],

>>>       | 70.00th=[   85], 80.00th=[   98], 90.00th=[  120], 95.00th=[

>>> 139],

>>>       | 99.00th=[  178], 99.50th=[  194], 99.90th=[  229], 99.95th=[

>>> 247],

>>>       | 99.99th=[  273]

>>>      bw (KB  /s): min=301056, max=612352, per=25.01%, avg=449291.87,

>>> stdev=54288.85

>>>      lat (msec) : 4=0.11%, 10=0.61%, 20=2.11%, 50=27.87%, 100=50.92%

>>>      lat (msec) : 250=18.34%, 500=0.03%

>>>    cpu          : usr=0.19%, sys=33.60%, ctx=66708, majf=0, minf=636

>>>    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=99.7%,

>>>  >=64=0.0%

>>>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,

>>>  >=64=0.0%

>>>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,

>>>  >=64=0.0%

>>>       issued    : total=r=38699/w=0/d=0, short=r=0/w=0/d=0

>>>

>>> Run status group 0 (all jobs):

>>>     READ: io=38699MB, aggrb=*1754.2MB/s*, minb=1754.2MB/s,

>>>

>>> maxb=1754.2MB/s, mint=22062msec, maxt=22062msec

>>>

>>> Disk stats (read/write):

>>>    rbd1: ios=77386/17, merge=0/122, ticks=3168312/500, in_queue=3170168,

>>> util=99.76%

>>>

>>> The thing is that this test was running from a 'HP Blade enclosure with

>>> QDR' so I think that if in QDR the max Throughput is around 3.2 GB/s (I

>>> guess that this number must be divided by the total number of ports, in

>>> this case 2, so a maximum of 1.6GB/s is the max of throughput that I'll

>>> get on a single port, is that correct? Also I made another test in

>>> another host that also had FDR so (max throughput would be around 6.8

>>> GB/s), and if the same theory is valid, that would lead me to 3.4 GB/s

>>> per port, but I'm not getting more than 1.4 - 1.6 GB/s, any ideas? same

>>> tuning on both servers.

>>>

>>> Basically I changed the scaling_governor of the cpufreq of all cpus to

>>> 'performance' and then set the following values:

>>>

>>> sysctl -w net.ipv4.tcp_timestamps=0

>>> sysctl -w net.core.netdev_max_backlog=250000

>>> sysctl -w net.core.rmem_max=4194304

>>> sysctl -w net.core.wmem_max=4194304

>>> sysctl -w net.core.rmem_default=4194304

>>> sysctl -w net.core.wmem_default=4194304

>>> sysctl -w net.core.optmem_max=4194304

>>> sysctl -w net.ipv4.tcp_rmem="4096 87380 4194304"

>>> sysctl -w net.ipv4.tcp_wmem="4096 65536 4194304"

>>> sysctl -w net.ipv4.tcp_low_latency=1

>>>

>>>

>>> However, on the HP blade, there's no Intel CPUs like the other server,

>>> so this kind of 'tuning' can't be done, so I left it as a default and

>>> only changed the TCP networking part.

>>>

>>> Any comments or hint would be really appreciated.

>>>

>>> Thanks in advance,

>>>

>>> Best,

>>>

>>>

>>> **

>>>

>>> *German

>>>

>>> *

>>> 2015-11-23 15:06 GMT-03:00 Robert LeBlanc <robert@xxxxxxxxxxxxx

>>> <mailto:robert@xxxxxxxxxxxxx>>:

>>>

>>>

>>>     -----BEGIN PGP SIGNED MESSAGE-----

>>>     Hash: SHA256

>>>

>>>     Are you using unconnected mode or connected mode? With connected mode

>>>     you can up your MTU to 64K which may help on the network side.

>>>     - ----------------

>>>     Robert LeBlanc

>>>     PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

>>>

>>>

>>>     On Mon, Nov 23, 2015 at 10:40 AM, German Anders  wrote:

>>>      > Hi Mark,

>>>      >

>>>      > Thanks a lot for the quick response. Regarding the numbers that

>>>     you send me,

>>>      > they look REALLY nice. I've the following setup

>>>      >

>>>      > 4 OSD nodes:

>>>      >

>>>      > 2 x Intel Xeon E5-2650v2 @2.60Ghz

>>>      > 1 x Network controller: Mellanox Technologies MT27500 Family

>>>     [ConnectX-3]

>>>      > Dual-Port (1 for PUB and 1 for CLUS)

>>>      > 1 x SAS2308 PCI-Express Fusion-MPT SAS-2

>>>      > 8 x Intel SSD DC S3510 800GB (1 OSD on each drive + journal on

>>>     the same

>>>      > drive, so 1:1 relationship)

>>>      > 3 x Intel SSD DC S3710 200GB (to be used maybe as a cache tier)

>>>      > 128GB RAM

>>>      >

>>>      > [0:0:0:0]    disk    ATA      INTEL SSDSC2BA20 0110  /dev/sdc

>>>      > [0:0:1:0]    disk    ATA      INTEL SSDSC2BA20 0110  /dev/sdd

>>>      > [0:0:2:0]    disk    ATA      INTEL SSDSC2BA20 0110  /dev/sde

>>>      > [0:0:3:0]    disk    ATA      INTEL SSDSC2BB80 0130  /dev/sdf

>>>      > [0:0:4:0]    disk    ATA      INTEL SSDSC2BB80 0130  /dev/sdg

>>>      > [0:0:5:0]    disk    ATA      INTEL SSDSC2BB80 0130  /dev/sdh

>>>      > [0:0:6:0]    disk    ATA      INTEL SSDSC2BB80 0130  /dev/sdi

>>>      > [0:0:7:0]    disk    ATA      INTEL SSDSC2BB80 0130  /dev/sdj

>>>      > [0:0:8:0]    disk    ATA      INTEL SSDSC2BB80 0130  /dev/sdk

>>>      > [0:0:9:0]    disk    ATA      INTEL SSDSC2BB80 0130  /dev/sdl

>>>      > [0:0:10:0]   disk    ATA      INTEL SSDSC2BB80 0130  /dev/sdm

>>>      >

>>>      > sdf                                8:80   0 745.2G  0 disk

>>>      > |-sdf1                             8:81   0 740.2G  0 part

>>>      > /var/lib/ceph/osd/ceph-16

>>>      > `-sdf2                             8:82   0     5G  0 part

>>>      > sdg                                8:96   0 745.2G  0 disk

>>>      > |-sdg1                             8:97   0 740.2G  0 part

>>>      > /var/lib/ceph/osd/ceph-17

>>>      > `-sdg2                             8:98   0     5G  0 part

>>>      > sdh                                8:112  0 745.2G  0 disk

>>>      > |-sdh1                             8:113  0 740.2G  0 part

>>>      > /var/lib/ceph/osd/ceph-18

>>>      > `-sdh2                             8:114  0     5G  0 part

>>>      > sdi                                8:128  0 745.2G  0 disk

>>>      > |-sdi1                             8:129  0 740.2G  0 part

>>>      > /var/lib/ceph/osd/ceph-19

>>>      > `-sdi2                             8:130  0     5G  0 part

>>>      > sdj                                8:144  0 745.2G  0 disk

>>>      > |-sdj1                             8:145  0 740.2G  0 part

>>>      > /var/lib/ceph/osd/ceph-20

>>>      > `-sdj2                             8:146  0     5G  0 part

>>>      > sdk                                8:160  0 745.2G  0 disk

>>>      > |-sdk1                             8:161  0 740.2G  0 part

>>>      > /var/lib/ceph/osd/ceph-21

>>>      > `-sdk2                             8:162  0     5G  0 part

>>>      > sdl                                8:176  0 745.2G  0 disk

>>>      > |-sdl1                             8:177  0 740.2G  0 part

>>>      > /var/lib/ceph/osd/ceph-22

>>>      > `-sdl2                             8:178  0     5G  0 part

>>>      > sdm                                8:192  0 745.2G  0 disk

>>>      > |-sdm1                             8:193  0 740.2G  0 part

>>>      > /var/lib/ceph/osd/ceph-23

>>>      > `-sdm2                             8:194  0     5G  0 part

>>>      >

>>>      >

>>>      > $ rados bench -p rbd 20 write --no-cleanup -t 4

>>>      >  Maintaining 4 concurrent writes of 4194304 bytes for up to 20

>>>     seconds or 0

>>>      > objects

>>>      >  Object prefix: benchmark_data_cibm01_1409

>>>      >    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat

>>>       avg lat

>>>      >      0       0         0         0         0         0         -

>>>             0

>>>      >      1       4       121       117   467.894       468 0.0337203

>>>     0.0336809

>>>      >      2       4       244       240   479.895       492 0.0304306

>>>     0.0330524

>>>      >      3       4       372       368   490.559       512 0.0361914

>>>     0.0323822

>>>      >      4       4       491       487   486.899       476 0.0346544

>>>     0.0327169

>>>      >      5       4       587       583   466.302       384  0.110718

>>>     0.0342427

>>>      >      6       4       701       697   464.575       456 0.0324953

>>>     0.0343136

>>>      >      7       4       811       807   461.053       440 0.0400344

>>>     0.0345994

>>>      >      8       4       923       919   459.412       448 0.0255677

>>>     0.0345767

>>>      >      9       4      1032      1028   456.803       436 0.0309743

>>>     0.0349256

>>>      >     10       4      1119      1115   445.917       348  0.229508

>>>     0.0357856

>>>      >     11       4      1222      1218   442.826       412 0.0277902

>>>     0.0360635

>>>      >     12       4      1315      1311   436.919       372 0.0303377

>>>     0.0365673

>>>      >     13       4      1424      1420   436.842       436 0.0288001

>>>       0.03659

>>>      >     14       4      1524      1520   434.206       400 0.0360993

>>>     0.0367697

>>>      >     15       4      1632      1628   434.054       432 0.0296406

>>>     0.0366877

>>>      >     16       4      1740      1736   433.921       432 0.0310995

>>>     0.0367746

>>>      >     17       4      1836      1832    430.98       384 0.0250518

>>>     0.0370169

>>>      >     18       4      1941      1937   430.366       420  0.027502

>>>     0.0371341

>>>      >     19       4      2049      2045   430.448       432 0.0260257

>>>     0.0370807

>>>      > 2015-11-23 12:10:58.587087min lat: 0.0229266 max lat: 0.27063 avg

>>>     lat:

>>>      > 0.0373936

>>>      >    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat

>>>       avg lat

>>>      >     20       4      2141      2137   427.322       368 0.0351276

>>>     0.0373936

>>>      >  Total time run:         20.186437

>>>      > Total writes made:      2141

>>>      > Write size:             4194304

>>>      > Bandwidth (MB/sec):     424.245

>>>      >

>>>      > Stddev Bandwidth:       102.136

>>>      > Max bandwidth (MB/sec): 512

>>>      > Min bandwidth (MB/sec): 0

>>>      > Average Latency:        0.0376536

>>>      > Stddev Latency:         0.032886

>>>      > Max latency:            0.27063

>>>      > Min latency:            0.0229266

>>>      >

>>>      >

>>>      > $ rados bench -p rbd 20 seq --no-cleanup -t 4

>>>      >    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat

>>>       avg lat

>>>      >      0       0         0         0         0         0         -

>>>             0

>>>      >      1       4       394       390   1559.52      1560 0.0148888

>>>     0.0102236

>>>      >      2       4       753       749   1496.68      1436 0.0129162

>>>     0.0106595

>>>      >      3       4      1137      1133   1509.65      1536 0.0101854

>>>     0.0105731

>>>      >      4       4      1526      1522   1521.17      1556 0.0122154

>>>     0.0103827

>>>      >      5       4      1890      1886   1508.07      14560.00825445

>>>     0.0105908

>>>      >  Total time run:        5.675418

>>>      > Total reads made:     2141

>>>      > Read size:            4194304

>>>      > Bandwidth (MB/sec):    1508.964

>>>      >

>>>      > Average Latency:       0.0105951

>>>      > Max latency:           0.211469

>>>      > Min latency:           0.00603694

>>>      >

>>>      >

>>>      > I'm not even close to those numbers that you are getting... :(

>>>     any ideas? or

>>>      > hints? Also I've configured NOOP as the scheduler for all the SSD

>>>     disks. I

>>>      > don't know really what else to look for, in order to improve

>>>     performance and

>>>      > get some similar numbers from what you are getting

>>>      >

>>>      >

>>>      > Thanks in advance,

>>>      >

>>>      > Cheers,

>>>      >

>>>      >

>>>      > German

>>>      >

>>>      > 2015-11-23 13:32 GMT-03:00 Mark Nelson :

>>>      >>

>>>      >> Hi German,

>>>      >>

>>>      >> I don't have exactly the same setup, but on the ceph community

>>>     cluster I

>>>      >> have tests with:

>>>      >>

>>>      >> 4 nodes, each of which are configured in some tests with:

>>>      >>

>>>      >> 2 x Intel Xeon E5-2650

>>>      >> 1 x Intel XL710 40GbE (currently limited to about 2.5GB/s each)

>>>      >> 1 x Intel P3700 800GB (4 OSDs per card using 4 data and 4 journal

>>>      >> partitions)

>>>      >> 64GB RAM

>>>      >>

>>>      >> With filestore, I can get an aggregate throughput of:

>>>      >>

>>>      >> 1MB randread: 8715.3MB/s

>>>      >> 4MB randread: 8046.2MB/s

>>>      >>

>>>      >> This is with 4 fio instances on the same nodes as the OSDs using

>>>     the fio

>>>      >> librbd engine.

>>>      >>

>>>      >> A couple of things I would suggest trying:

>>>      >>

>>>      >> 1) See how rados bench does.  This is an easy test and you can

>>>     see how

>>>      >> different the numbers look.

>>>      >>

>>>      >> 2) try fio with librbd to see if it might be a qemu limitation.

>>>      >>

>>>      >> 3) Assuming you are using IPoIB, try some iperf tests to see how

>>>     your

>>>      >> network is doing.

>>>      >>

>>>      >> Mark

>>>      >>

>>>      >>

>>>      >> On 11/23/2015 10:17 AM, German Anders wrote:

>>>      >>>

>>>      >>> Thanks a lot for the quick update Greg. This lead me to ask if

>>>     there's

>>>      >>> anything out there to improve performance in an Infiniband

>>>     environment

>>>      >>> with Ceph. In the cluster that I mentioned earlier. I've setup

>>>     4 OSD

>>>      >>> server nodes nodes each with 8 OSD daemons running with 800x

>>>     Intel SSD

>>>      >>> DC S3710 disks (740.2G for OSD and 5G for Journal) and also

>>>     using IB FDR

>>>      >>> 56Gb/s for the PUB and CLUS network, and I'm getting the

>>>     following fio

>>>      >>> numbers:

>>>      >>>

>>>      >>>

>>>      >>> # fio --rw=randread --bs=1m --numjobs=4 --iodepth=32

>>> --runtime=22

>>>      >>> --time_based --size=16777216k --loops=1 --ioengine=libaio

>>>     --direct=1

>>>      >>> --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap

>>>      >>> --group_reporting --exitall --name

>>>      >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec

>>>      >>> --filename=/mnt/rbd/test1

>>>      >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (g=0):

>>>     rw=randread,

>>>      >>> bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32

>>>      >>> ...

>>>      >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (g=0):

>>>     rw=randread,

>>>      >>> bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32

>>>      >>> fio-2.1.3

>>>      >>> Starting 4 processes

>>>      >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: Laying out IO

>>>     file(s)

>>>      >>> (1 file(s) / 16384MB)

>>>      >>> Jobs: 4 (f=4): [rrrr] [33.8% done] [1082MB/0KB/0KB /s]

>>>     [1081/0/0 iops]

>>>      >>> [eta 00m:45s]

>>>      >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (groupid=0,

>>>     jobs=4):

>>>      >>> err= 0: pid=63852: Mon Nov 23 10:48:07 2015

>>>      >>>    read : io=21899MB, bw=988.23MB/s, iops=988, runt= 22160msec

>>>      >>>      slat (usec): min=192, max=186274, avg=3990.48,

>>> stdev=7533.77

>>>      >>>      clat (usec): min=10, max=808610, avg=125099.41,

>>> stdev=90717.56

>>>      >>>       lat (msec): min=6, max=809, avg=129.09, stdev=91.14

>>>      >>>      clat percentiles (msec):

>>>      >>>       |  1.00th=[   27],  5.00th=[   38], 10.00th=[   45],

>>>     20.00th=[

>>>      >>> 61],

>>>      >>>       | 30.00th=[   74], 40.00th=[   85], 50.00th=[  100],

>>>     60.00th=[

>>>      >>> 117],

>>>      >>>       | 70.00th=[  141], 80.00th=[  174], 90.00th=[  235],

>>>     95.00th=[

>>>      >>> 297],

>>>      >>>       | 99.00th=[  482], 99.50th=[  578], 99.90th=[  717],

>>>     99.95th=[

>>>      >>> 750],

>>>      >>>       | 99.99th=[  775]

>>>      >>>      bw (KB  /s): min=134691, max=335872, per=25.08%,

>>>     avg=253748.08,

>>>      >>> stdev=40454.88

>>>      >>>      lat (usec) : 20=0.01%

>>>      >>>      lat (msec) : 10=0.02%, 20=0.27%, 50=12.90%, 100=36.93%,

>>>     250=41.39%

>>>      >>>      lat (msec) : 500=7.59%, 750=0.84%, 1000=0.05%

>>>      >>>    cpu          : usr=0.11%, sys=26.76%, ctx=39695, majf=0,

>>>     minf=405

>>>      >>>    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.3%,

>>>     32=99.4%,

>>>      >>>  >=64=0.0%

>>>      >>>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,

>>>     64=0.0%,

>>>      >>>  >=64=0.0%

>>>      >>>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%,

>>>     64=0.0%,

>>>      >>>  >=64=0.0%

>>>      >>>       issued    : total=r=21899/w=0/d=0, short=r=0/w=0/d=0

>>>      >>>

>>>      >>> Run status group 0 (all jobs):

>>>      >>>     READ: io=21899MB, aggrb=988.23MB/s, minb=988.23MB/s,

>>>      >>> maxb=988.23MB/s, mint=22160msec, maxt=22160msec

>>>      >>>

>>>      >>> Disk stats (read/write):

>>>      >>>    rbd1: ios=43736/163, merge=0/5, ticks=3189484/15276,

>>>      >>> in_queue=3214988, util=99.78%

>>>      >>>

>>>      >>>

>>>      >>>

>>>      >>>

>>>

>>> ############################################################################################################################################################

>>>      >>>

>>>      >>>

>>>      >>> # fio --rw=randread --bs=4m --numjobs=4 --iodepth=32

>>> --runtime=22

>>>      >>> --time_based --size=16777216k --loops=1 --ioengine=libaio

>>>     --direct=1

>>>      >>> --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap

>>>      >>> --group_reporting --exitall --name

>>>      >>> dev-ceph-randread-4m-4thr-libaio-32iodepth-22sec

>>>      >>> --filename=/mnt/rbd/test2

>>>      >>>

>>>      >>> fio-2.1.3

>>>      >>> Starting 4 processes

>>>      >>> dev-ceph-randread-4m-4thr-libaio-32iodepth-22sec: Laying out IO

>>>     file(s)

>>>      >>> (1 file(s) / 16384MB)

>>>      >>> Jobs: 4 (f=4): [rrrr] [28.7% done] [894.3MB/0KB/0KB /s]

>>>     [223/0/0 iops]

>>>      >>> [eta 00m:57s]

>>>      >>> dev-ceph-randread-4m-4thr-libaio-32iodepth-22sec: (groupid=0,

>>>     jobs=4):

>>>      >>> err= 0: pid=64654: Mon Nov 23 10:51:58 2015

>>>      >>>    read : io=18952MB, bw=876868KB/s, iops=214, runt= 22132msec

>>>      >>>      slat (usec): min=518, max=81398, avg=18576.88,

>>> stdev=14840.55

>>>      >>>      clat (msec): min=90, max=1915, avg=570.37, stdev=166.51

>>>      >>>       lat (msec): min=123, max=1936, avg=588.95, stdev=169.19

>>>      >>>      clat percentiles (msec):

>>>      >>>       |  1.00th=[  258],  5.00th=[  343], 10.00th=[  383],

>>>     20.00th=[

>>>      >>> 437],

>>>      >>>       | 30.00th=[  482], 40.00th=[  519], 50.00th=[  553],

>>>     60.00th=[

>>>      >>> 594],

>>>      >>>       | 70.00th=[  627], 80.00th=[  685], 90.00th=[  775],

>>>     95.00th=[

>>>      >>> 865],

>>>      >>>       | 99.00th=[ 1057], 99.50th=[ 1156], 99.90th=[ 1680],

>>>     99.95th=[

>>>      >>> 1860],

>>>      >>>       | 99.99th=[ 1909]

>>>      >>>      bw (KB  /s): min= 5665, max=383251, per=24.61%,

>>> avg=215755.74,

>>>      >>> stdev=61735.70

>>>      >>>      lat (msec) : 100=0.02%, 250=0.80%, 500=33.88%, 750=53.31%,

>>>      >>> 1000=10.26%

>>>      >>>      lat (msec) : 2000=1.73%

>>>      >>>    cpu          : usr=0.07%, sys=12.52%, ctx=32466, majf=0,

>>>     minf=372

>>>      >>>    IO depths    : 1=0.1%, 2=0.2%, 4=0.3%, 8=0.7%, 16=1.4%,

>>>     32=97.4%,

>>>      >>>  >=64=0.0%

>>>      >>>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,

>>>     64=0.0%,

>>>      >>>  >=64=0.0%

>>>      >>>       complete  : 0=0.0%, 4=99.9%, 8=0.0%, 16=0.0%, 32=0.1%,

>>>     64=0.0%,

>>>      >>>  >=64=0.0%

>>>      >>>       issued    : total=r=4738/w=0/d=0, short=r=0/w=0/d=0

>>>      >>>

>>>      >>> Run status group 0 (all jobs):

>>>      >>>     READ: io=18952MB, aggrb=876868KB/s, minb=876868KB/s,

>>>      >>> maxb=876868KB/s, mint=22132msec, maxt=22132msec

>>>      >>>

>>>      >>> Disk stats (read/write):

>>>      >>>    rbd1: ios=37721/177, merge=0/5, ticks=3075924/11408,

>>>      >>> in_queue=3097448, util=99.77%

>>>      >>>

>>>      >>>

>>>      >>> Can anyone share some results from a similar environment?

>>>      >>>

>>>      >>> Thanks in advance,

>>>      >>>

>>>      >>> Best,

>>>      >>>

>>>      >>> **

>>>      >>>

>>>      >>> *German*

>>>      >>>

>>>      >>> 2015-11-23 13:08 GMT-03:00 Gregory Farnum >> >:

>>>     >>>

>>>     >>>     On Mon, Nov 23, 2015 at 10:05 AM, German Anders

>>>      >>>     > wrote:

>>>      >>>     > Hi all,

>>>      >>>     >

>>>      >>>     > I want to know if there's any improvement or update

>>>     regarding ceph

>>>      >>> 0.94.5

>>>      >>>     > with accelio, I've an already configured cluster (with no

>>>     data on

>>>      >>> it) and I

>>>      >>>     > would like to know if there's a way to 'modify' the

>>>     cluster in

>>>      >>> order to use

>>>      >>>     > accelio. Any info would be really appreciated.

>>>      >>>

>>>      >>>     The XioMessenger is still experimental. As far as I know

>>>     it's not

>>>      >>>     expected to be stable any time soon and I can't imagine it

>>>     will be

>>>      >>>     backported to Hammer even when done.

>>>      >>>     -Greg

>>>      >>>

>>>      >>>

>>>      >>>

>>>      >>>

>>>      >>> _______________________________________________

>>>      >>> ceph-users mailing list

>>>      >>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>

>>>      >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>      >>>

>>>      >> _______________________________________________

>>>      >> ceph-users mailing list

>>>      >> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>

>>>      >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>      >

>>>      >

>>>      >

>>>      > _______________________________________________

>>>      > ceph-users mailing list

>>>      > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>

>>>      > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>      >

>>>

>>>     -----BEGIN PGP SIGNATURE-----

>>>     Version: Mailvelope v1.2.3

>>>     Comment: https://www.mailvelope.com

>>>

>>>     wsFcBAEBCAAQBQJWU1WqCRDmVDuy+mK58QAAo5cQALjuZB+dyjbcRDyScvj/

>>>     qjurMqCHlScgG9U8CE4L6/E/QUfCNmdvE4KaeQC82oj/SplXYOuglTHJkUMg

>>>     KPyjb9jJs+ZyS560IoUB/l/XQZpO9WL+DNnSAg96Hpb3eG+G5jukW9/E/QHQ

>>>     aDjn/c1njEqUhxMAosUFZR58CxejyyI5Vr/SXX+oE6y2tCF31Z3KPiOVTOtj

>>>     BPIx74xpigXMSP+zaK4UelhjPzrRnefkN2sLpQS5uwJlOY1f35KoM3dX+LHO

>>>     2BWpyrLUtL6ZzpalKr/QbaWko1VM109vjAoPZ3X82ig9DZp2DW8ZVX4abVcy

>>>     +Zyre4SCncKFJZcL9VkQHPJxRFhqXHC43mpSHIKmhuhmGVwr9ngiKGUY1Q7t

>>>     O0aks06KHfqSRxjWmuhtP0eMLwsH7gLAEqqtAjnIhRTCDDkhRdp/MdZJ7ftO

>>>     LHF9+Eqdp/KiVrGK7BX9zwVshr608bR4g7JCfK4/ukSHXOWFVR6GZ8jue85q

>>>     e6dWhHsdwrPt1QnSrfhnKjoMdhTpvPVzlxqo2jHDXEyE57RxW/zXr776HxcQ

>>>     cISj4zDZ0nGZ1F8w4DdB0ql8CpsCDAEoaNG0ZQPXcItyrHIB0lFOJYDi5m+4

>>>     YqOCG8TWh7b28IbEEwwUSpx3pi2iyH0ObJZM5dgf62AOCKCEsixf+UguFVwd

>>>     /jdL

>>>     =6LtO

>>>     -----END PGP SIGNATURE-----

>>>

>>>

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com