Re: Ceph 0.94.5 with accelio

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I'll try to put the ports on the HP IB QDR switch to 4K and then configured the interfaces also to mtu 4096 and do the same tests again and see what are the results. However, is there any other parameter that I need to take into account to tune for this? For example this is the port configuration of one of the Blade hosts:

$ ibportstate -L 29 17 query
Switch PortInfo:
# Port info: Lid 29 port 17
LinkState:.......................Active
PhysLinkState:...................LinkUp
Lid:.............................75
SMLid:...........................2328
LMC:.............................0
LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:.................10.0 Gbps
Peer PortInfo:
# Port info: Lid 29 DR path slid 4; dlid 65535; 0,17 port 1
LinkState:.......................Active
PhysLinkState:...................LinkUp
Lid:.............................32
SMLid:...........................2
LMC:.............................0
LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:.................10.0 Gbps
Mkey:............................<not displayed>
MkeyLeasePeriod:.................0
ProtectBits:.....................0

Ok changed the MTU from 65520 on the hosts to 4096 drops really bad the performance, from 1.7GB/s to 143.8MB/s,... I'll keep looking after this..

Best,



German

2015-11-24 14:35 GMT-03:00 Robert LeBlanc <robert@xxxxxxxxxxxxx>:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I've gotten about 3.2 GB/s with IPoIB on QDR, but it took a couple of
weeks of tuning to get that rate. If your switch is at 2048 MTU, it is
really hard to get it increased without an outage if I remember
correctly. Connected mode is much easier to get higher MTUs, but it
was a bit flaky with IPoIB (had to send several pings to get the
connection established some times). This was all a couple of years ago
now so my memory is a bit fuzzy. My current IB Ceph cluster is so
small that doing any tuning is not going to help because the
bottleneck is my disks and CPU.
- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Nov 24, 2015 at 10:26 AM, German Anders  wrote:
> Thanks a lot Robert for the explanation. I understand what you are saying
> and I'm also excited to see more about IB with Ceph to get those performance
> numbers up, and hopefully (hopefully soon) to see accelio working for
> production. Regarding the HP IB switch we got 4 ports (uplinks) connected to
> our IB SW, and internally the blades are connected through the backplane to
> two ports so they used the total number of ports inside the Encl SW (16
> ports). The bonding that I've configured is active/backup, I didn't know
> that active/active is possible with IPoIB. Also, the adapters that we got on
> the ceph nodes (supermicro servers), are Mellanox Technologies MT27500
> Family [ConnectX-3], I also double check the port type configuration on the
> IB SW and see that it's speed rate is 14.0 Gbps and also that the MTU
> supported is 4096 and the current line rate is 56.0 Gbps.
>
> I've try almost all possible combinations and I'm not getting any
> improvement that's more than 1.8 GB/s, so I was wondering if this is my top
> limit speed with this kind of setup.
>
> Best,
>
>
> German
>
> 2015-11-24 14:11 GMT-03:00 Robert LeBlanc :
>>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> I've had wildly different iperf results based on the version of the
>> kernel, OFED and whether you are using datagram or connected mode as
>> well as the MTU. You really have to just try all the different options
>> to figure out what works the best.
>>
>> Please also remember that you will not get iSER performance out of
>> Ceph at the moment (probably never), but the work being done will
>> help. Even if you get the network transport optimially tuned, unless
>> you have a massive Ceph cluster, you won't get the performance out the
>> of the SSDs. I'm just as excited about Ceph on Infiniband, but I've
>> had to just chill out and let the devs do their work.
>>
>> I've never had good experiences with active/active bonding on IPoIB.
>> For two blades in the same chassis, you should get non-blocking line
>> rate. For going out of the chassis, you will be limited by the number
>> of ports you connect to the upstream switch (that is why there is
>> usually the same number of uplink ports as there are blades so that
>> you can do non-blocking, however HP has been selling switches with
>> only half the uplinks making your oversubscription 2:1, it really
>> depends on what you actually need). Between QDR and FDR, you should
>> get QDR speed. Also be sure it is full FDR and not FDR-10 which is the
>> same signal rate as QDR but with the new 64/66 encoding, it won't give
>> you as much speed improvement as FDR and it can be difficult to tell
>> which your adapter has if you don't research it. We thought we bought
>> FDR cards only to find out later they were FDR-10.
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.2.3
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWVJpCCRDmVDuy+mK58QAAEX4P/jFvdBzNob2xdftEkD2K
>> rSB5i/Idmi7BAe1/JUzMF/t7l7zFXEpq96oLbt5NMbreOhCe6MitEApfhpWq
>> dmt3IZYyUYVvXCxNGE/U7L58wi9DGPKJTWsigKScFtqjcQkIOlCh2VAHCmnE
>> /WZBtlMnBsoibqq+zZsM4GEBwvPCwUwpGDKU13DhpuvmiN09jICEHH05wZzq
>> ig/Ia309ioAZJ8PEKZ61kHUxAzTIMhwe1LV2jtlGQcJB4jMq7TQzOyizq0mQ
>> 7DJTNNkMVpB9IEBCuOzzs/ByjKz+Tu31Jw2Y8R9MjtoDpOo+WQzzn6W4+NS0
>> jG0cFiumIBKVwoMJyXpQeS6UC0w7balHaXy+8F4SUa+J/9X5w4bH9MmlJBfh
>> p81YDtNs7mQYKsuDOkjNe0BkthhHbdQThHn4A75j8Hqaltwr28UqL83ywCUJ
>> SqTGkhRLyU9O74snPfG+T7hM4fIVpH7DS4ebmK7yvSVzwwuExPgwWhjvAsmt
>> DRnXv0qd8UAIgza0VYTyZuElUC4V39wMe503tXo5By+NGKWzVNOWR1X0+46i
>> Xq2zvZQzc9MPtGHMmnm1dkJ+d6imfLzTf099njZ+Wl1xbagnQiKbiwKL8T/k
>> d3OClf514rV4i7FtwOoB8NQcUMUjaeZGmPVDhmVt7fRYz/+rARkN/jwXH4qG
>> x/Dk
>> =/88f
>> -----END PGP SIGNATURE-----
>> ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Tue, Nov 24, 2015 at 8:24 AM, German Anders
>> wrote:
>> > Another test make between two HP blades with QDR (with bonding)
>> >
>> > e60-host01# iperf -s
>> > ------------------------------------------------------------
>> > Server listening on TCP port 5001
>> > TCP window size: 85.3 KByte (default)
>> > ------------------------------------------------------------
>> > [  5] local 172.23.18.2 port 5001 connected with 172.23.18.1 port 41807
>> > [  4] local 172.23.18.2 port 5001 connected with 172.23.18.1 port 41806
>> > [  6] local 172.23.18.2 port 5001 connected with 172.23.18.1 port 41808
>> > [  7] local 172.23.18.2 port 5001 connected with 172.23.18.1 port 41809
>> > [ ID] Interval       Transfer     Bandwidth
>> > [  5]  0.0-10.0 sec  2.64 GBytes  2.27 Gbits/sec
>> > [  4]  0.0-10.0 sec  2.64 GBytes  2.27 Gbits/sec
>> > [  6]  0.0-10.0 sec  3.58 GBytes  3.08 Gbits/sec
>> > [  7]  0.0-10.0 sec  3.57 GBytes  3.07 Gbits/sec
>> > [SUM]  0.0-10.0 sec  12.4 GBytes  10.7 Gbits/sec
>> >
>> > e60-host02# iperf -c 172.23.18.2 -P 4
>> >
>> > ------------------------------------------------------------
>> > Client connecting to 172.23.18.2, TCP port 5001
>> > TCP window size: 2.50 MByte (default)
>> > ------------------------------------------------------------
>> > [  3] local 172.23.18.1 port 41806 connected with 172.23.18.2 port 5001
>> > [  5] local 172.23.18.1 port 41808 connected with 172.23.18.2 port 5001
>> > [  4] local 172.23.18.1 port 41807 connected with 172.23.18.2 port 5001
>> > [  6] local 172.23.18.1 port 41809 connected with 172.23.18.2 port 5001
>> > [ ID] Interval       Transfer     Bandwidth
>> > [  3]  0.0-10.0 sec  2.64 GBytes  2.27 Gbits/sec
>> > [  5]  0.0-10.0 sec  3.58 GBytes  3.08 Gbits/sec
>> > [  4]  0.0-10.0 sec  2.64 GBytes  2.27 Gbits/sec
>> > [  6]  0.0-10.0 sec  3.57 GBytes  3.07 Gbits/sec
>> > [SUM]  0.0-10.0 sec  12.4 GBytes  10.7 Gbits/sec
>> >
>> > notice that also the blades are on the same enclosure.
>> >
>> > bonding configuration:
>> >
>> > alias bond-ib bonding options bonding mode=1 miimon=100 downdelay=100
>> > updelay=100 max_bonds=2
>> >
>> > ## INFINIBAND CONF
>> >
>> > auto ib0
>> > iface ib0 inet manual
>> >         bond-master bond-ib
>> >
>> > auto ib1
>> > iface ib1 inet manual
>> >         bond-master bond-ib
>> >
>> > auto bond-ib
>> > iface bond-ib inet static
>> >         address 172.23.xx.xx
>> >         netmask 255.255.xx.xx
>> >         slaves ib0 ib1
>> >         bond_miimon 100
>> >         bond_mode active-backup
>> >         pre-up echo connected > /sys/class/net/ib0/mode
>> >         pre-up echo connected > /sys/class/net/ib1/mode
>> >         pre-up /sbin/ifconfig ib0 mtu 65520
>> >         pre-up /sbin/ifconfig ib1 mtu 65520
>> >         pre-up modprobe bond-ib
>> >         pre-up /sbin/ifconfig bond-ib mtu 65520
>> >
>> >
>> > German
>> >
>> > 2015-11-24 11:51 GMT-03:00 Mark Nelson :
>> >>
>> >> Each port should be able to do 40Gb/s or 56Gb/s minus overhead and any
>> >> PCIe or car related bottlenecks.  IPoIB will further limit that,
>> >> especially
>> >> if you haven't done any kind of interrupt affinity tuning.
>> >>
>> >> Assuming these are mellanox cards you'll want to read this guide:
>> >>
>> >>
>> >>
>> >> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf
>> >>
>> >> For QDR I think the maximum throughput with IPoIB I've ever seen was
>> >> about
>> >> 2.7GB/s for a single port.  Typically 2-2.5GB/s is probably about what
>> >> you
>> >> should expect for a well tuned setup.
>> >>
>> >> I'd still suggest doing iperf tests.  It's really easy:
>> >>
>> >> "iperf -s" on one node to act as a server.
>> >>
>> >> "iperf -c  -P " on the client
>> >>
>> >> This will give you an idea of how your network is doing.  All-To-All
>> >> network tests are also useful, in that sometimes network issues can
>> >> crop up
>> >> only when there's lots of traffic across many ports.  We've seen this
>> >> in lab
>> >> environments, especially with bonded ethernet.
>> >>
>> >> Mark
>> >>
>> >> On 11/24/2015 07:22 AM, German Anders wrote:
>> >>>
>> >>> After doing some more in deep research and tune some parameters I've
>> >>> gain a little bit more of performance:
>> >>>
>> >>> # fio --rw=randread --bs=1m --numjobs=4 --iodepth=32 --runtime=22
>> >>> --time_based --size=16777216k --loops=1 --ioengine=libaio --direct=1
>> >>> --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap
>> >>> --group_reporting --exitall --name
>> >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec
>> >>> --filename=/mnt/e60host01vol1/test1
>> >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (g=0): rw=randread,
>> >>> bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32
>> >>> ...
>> >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (g=0): rw=randread,
>> >>> bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32
>> >>> fio-2.1.3
>> >>> Starting 4 processes
>> >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: Laying out IO
>> >>> file(s)
>> >>> (1 file(s) / 16384MB)
>> >>> Jobs: 4 (f=4): [rrrr] [60.5% done] [*1714MB*/0KB/0KB /s] [1713/0/0
>> >>> iops]
>> >>>
>> >>> [eta 00m:15s]
>> >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (groupid=0, jobs=4):
>> >>> err= 0: pid=54857: Tue Nov 24 07:56:30 2015
>> >>>    read : io=38699MB, bw=1754.2MB/s, iops=1754, runt= 22062msec
>> >>>      slat (usec): min=131, max=63426, avg=2249.87, stdev=4320.91
>> >>>      clat (msec): min=2, max=321, avg=70.56, stdev=35.80
>> >>>       lat (msec): min=2, max=321, avg=72.81, stdev=36.13
>> >>>      clat percentiles (msec):
>> >>>       |  1.00th=[   13],  5.00th=[   24], 10.00th=[   30], 20.00th=[
>> >>> 40],
>> >>>       | 30.00th=[   50], 40.00th=[   57], 50.00th=[   65], 60.00th=[
>> >>> 75],
>> >>>       | 70.00th=[   85], 80.00th=[   98], 90.00th=[  120], 95.00th=[
>> >>> 139],
>> >>>       | 99.00th=[  178], 99.50th=[  194], 99.90th=[  229], 99.95th=[
>> >>> 247],
>> >>>       | 99.99th=[  273]
>> >>>      bw (KB  /s): min=301056, max=612352, per=25.01%, avg=449291.87,
>> >>> stdev=54288.85
>> >>>      lat (msec) : 4=0.11%, 10=0.61%, 20=2.11%, 50=27.87%, 100=50.92%
>> >>>      lat (msec) : 250=18.34%, 500=0.03%
>> >>>    cpu          : usr=0.19%, sys=33.60%, ctx=66708, majf=0, minf=636
>> >>>    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=99.7%,
>> >>>  >=64=0.0%
>> >>>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>> >>>  >=64=0.0%
>> >>>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
>> >>>  >=64=0.0%
>> >>>       issued    : total=r=38699/w=0/d=0, short=r=0/w=0/d=0
>> >>>
>> >>> Run status group 0 (all jobs):
>> >>>     READ: io=38699MB, aggrb=*1754.2MB/s*, minb=1754.2MB/s,
>> >>>
>> >>> maxb=1754.2MB/s, mint=22062msec, maxt=22062msec
>> >>>
>> >>> Disk stats (read/write):
>> >>>    rbd1: ios=77386/17, merge=0/122, ticks=3168312/500,
>> >>> in_queue=3170168,
>> >>> util=99.76%
>> >>>
>> >>> The thing is that this test was running from a 'HP Blade enclosure
>> >>> with
>> >>> QDR' so I think that if in QDR the max Throughput is around 3.2 GB/s
>> >>> (I
>> >>> guess that this number must be divided by the total number of ports,
>> >>> in
>> >>> this case 2, so a maximum of 1.6GB/s is the max of throughput that
>> >>> I'll
>> >>> get on a single port, is that correct? Also I made another test in
>> >>> another host that also had FDR so (max throughput would be around 6.8
>> >>> GB/s), and if the same theory is valid, that would lead me to 3.4 GB/s
>> >>> per port, but I'm not getting more than 1.4 - 1.6 GB/s, any ideas?
>> >>> same
>> >>> tuning on both servers.
>> >>>
>> >>> Basically I changed the scaling_governor of the cpufreq of all cpus to
>> >>> 'performance' and then set the following values:
>> >>>
>> >>> sysctl -w net.ipv4.tcp_timestamps=0
>> >>> sysctl -w net.core.netdev_max_backlog=250000
>> >>> sysctl -w net.core.rmem_max=4194304
>> >>> sysctl -w net.core.wmem_max=4194304
>> >>> sysctl -w net.core.rmem_default=4194304
>> >>> sysctl -w net.core.wmem_default=4194304
>> >>> sysctl -w net.core.optmem_max=4194304
>> >>> sysctl -w net.ipv4.tcp_rmem="4096 87380 4194304"
>> >>> sysctl -w net.ipv4.tcp_wmem="4096 65536 4194304"
>> >>> sysctl -w net.ipv4.tcp_low_latency=1
>> >>>
>> >>>
>> >>> However, on the HP blade, there's no Intel CPUs like the other server,
>> >>> so this kind of 'tuning' can't be done, so I left it as a default and
>> >>> only changed the TCP networking part.
>> >>>
>> >>> Any comments or hint would be really appreciated.
>> >>>
>> >>> Thanks in advance,
>> >>>
>> >>> Best,
>> >>>
>> >>>
>> >>> **
>> >>>
>> >>> *German
>> >>>
>> >>> *
>> >>> 2015-11-23 15:06 GMT-03:00 Robert LeBlanc > >>> >:
>> >>>
>> >>>
>> >>>     -----BEGIN PGP SIGNED MESSAGE-----
>> >>>     Hash: SHA256
>> >>>
>> >>>     Are you using unconnected mode or connected mode? With connected
>> >>> mode
>> >>>     you can up your MTU to 64K which may help on the network side.
>> >>>     - ----------------
>> >>>     Robert LeBlanc
>> >>>     PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>>
>> >>>
>> >>>     On Mon, Nov 23, 2015 at 10:40 AM, German Anders  wrote:
>> >>>      > Hi Mark,
>> >>>      >
>> >>>      > Thanks a lot for the quick response. Regarding the numbers that
>> >>>     you send me,
>> >>>      > they look REALLY nice. I've the following setup
>> >>>      >
>> >>>      > 4 OSD nodes:
>> >>>      >
>> >>>      > 2 x Intel Xeon E5-2650v2 @2.60Ghz
>> >>>      > 1 x Network controller: Mellanox Technologies MT27500 Family
>> >>>     [ConnectX-3]
>> >>>      > Dual-Port (1 for PUB and 1 for CLUS)
>> >>>      > 1 x SAS2308 PCI-Express Fusion-MPT SAS-2
>> >>>      > 8 x Intel SSD DC S3510 800GB (1 OSD on each drive + journal on
>> >>>     the same
>> >>>      > drive, so 1:1 relationship)
>> >>>      > 3 x Intel SSD DC S3710 200GB (to be used maybe as a cache tier)
>> >>>      > 128GB RAM
>> >>>      >
>> >>>      > [0:0:0:0]    disk    ATA      INTEL SSDSC2BA20 0110  /dev/sdc
>> >>>      > [0:0:1:0]    disk    ATA      INTEL SSDSC2BA20 0110  /dev/sdd
>> >>>      > [0:0:2:0]    disk    ATA      INTEL SSDSC2BA20 0110  /dev/sde
>> >>>      > [0:0:3:0]    disk    ATA      INTEL SSDSC2BB80 0130  /dev/sdf
>> >>>      > [0:0:4:0]    disk    ATA      INTEL SSDSC2BB80 0130  /dev/sdg
>> >>>      > [0:0:5:0]    disk    ATA      INTEL SSDSC2BB80 0130  /dev/sdh
>> >>>      > [0:0:6:0]    disk    ATA      INTEL SSDSC2BB80 0130  /dev/sdi
>> >>>      > [0:0:7:0]    disk    ATA      INTEL SSDSC2BB80 0130  /dev/sdj
>> >>>      > [0:0:8:0]    disk    ATA      INTEL SSDSC2BB80 0130  /dev/sdk
>> >>>      > [0:0:9:0]    disk    ATA      INTEL SSDSC2BB80 0130  /dev/sdl
>> >>>      > [0:0:10:0]   disk    ATA      INTEL SSDSC2BB80 0130  /dev/sdm
>> >>>      >
>> >>>      > sdf                                8:80   0 745.2G  0 disk
>> >>>      > |-sdf1                             8:81   0 740.2G  0 part
>> >>>      > /var/lib/ceph/osd/ceph-16
>> >>>      > `-sdf2                             8:82   0     5G  0 part
>> >>>      > sdg                                8:96   0 745.2G  0 disk
>> >>>      > |-sdg1                             8:97   0 740.2G  0 part
>> >>>      > /var/lib/ceph/osd/ceph-17
>> >>>      > `-sdg2                             8:98   0     5G  0 part
>> >>>      > sdh                                8:112  0 745.2G  0 disk
>> >>>      > |-sdh1                             8:113  0 740.2G  0 part
>> >>>      > /var/lib/ceph/osd/ceph-18
>> >>>      > `-sdh2                             8:114  0     5G  0 part
>> >>>      > sdi                                8:128  0 745.2G  0 disk
>> >>>      > |-sdi1                             8:129  0 740.2G  0 part
>> >>>      > /var/lib/ceph/osd/ceph-19
>> >>>      > `-sdi2                             8:130  0     5G  0 part
>> >>>      > sdj                                8:144  0 745.2G  0 disk
>> >>>      > |-sdj1                             8:145  0 740.2G  0 part
>> >>>      > /var/lib/ceph/osd/ceph-20
>> >>>      > `-sdj2                             8:146  0     5G  0 part
>> >>>      > sdk                                8:160  0 745.2G  0 disk
>> >>>      > |-sdk1                             8:161  0 740.2G  0 part
>> >>>      > /var/lib/ceph/osd/ceph-21
>> >>>      > `-sdk2                             8:162  0     5G  0 part
>> >>>      > sdl                                8:176  0 745.2G  0 disk
>> >>>      > |-sdl1                             8:177  0 740.2G  0 part
>> >>>      > /var/lib/ceph/osd/ceph-22
>> >>>      > `-sdl2                             8:178  0     5G  0 part
>> >>>      > sdm                                8:192  0 745.2G  0 disk
>> >>>      > |-sdm1                             8:193  0 740.2G  0 part
>> >>>      > /var/lib/ceph/osd/ceph-23
>> >>>      > `-sdm2                             8:194  0     5G  0 part
>> >>>      >
>> >>>      >
>> >>>      > $ rados bench -p rbd 20 write --no-cleanup -t 4
>> >>>      >  Maintaining 4 concurrent writes of 4194304 bytes for up to 20
>> >>>     seconds or 0
>> >>>      > objects
>> >>>      >  Object prefix: benchmark_data_cibm01_1409
>> >>>      >    sec Cur ops   started  finished  avg MB/s  cur MB/s  last
>> >>> lat
>> >>>       avg lat
>> >>>      >      0       0         0         0         0         0
>> >>> -
>> >>>             0
>> >>>      >      1       4       121       117   467.894       468
>> >>> 0.0337203
>> >>>     0.0336809
>> >>>      >      2       4       244       240   479.895       492
>> >>> 0.0304306
>> >>>     0.0330524
>> >>>      >      3       4       372       368   490.559       512
>> >>> 0.0361914
>> >>>     0.0323822
>> >>>      >      4       4       491       487   486.899       476
>> >>> 0.0346544
>> >>>     0.0327169
>> >>>      >      5       4       587       583   466.302       384
>> >>> 0.110718
>> >>>     0.0342427
>> >>>      >      6       4       701       697   464.575       456
>> >>> 0.0324953
>> >>>     0.0343136
>> >>>      >      7       4       811       807   461.053       440
>> >>> 0.0400344
>> >>>     0.0345994
>> >>>      >      8       4       923       919   459.412       448
>> >>> 0.0255677
>> >>>     0.0345767
>> >>>      >      9       4      1032      1028   456.803       436
>> >>> 0.0309743
>> >>>     0.0349256
>> >>>      >     10       4      1119      1115   445.917       348
>> >>> 0.229508
>> >>>     0.0357856
>> >>>      >     11       4      1222      1218   442.826       412
>> >>> 0.0277902
>> >>>     0.0360635
>> >>>      >     12       4      1315      1311   436.919       372
>> >>> 0.0303377
>> >>>     0.0365673
>> >>>      >     13       4      1424      1420   436.842       436
>> >>> 0.0288001
>> >>>       0.03659
>> >>>      >     14       4      1524      1520   434.206       400
>> >>> 0.0360993
>> >>>     0.0367697
>> >>>      >     15       4      1632      1628   434.054       432
>> >>> 0.0296406
>> >>>     0.0366877
>> >>>      >     16       4      1740      1736   433.921       432
>> >>> 0.0310995
>> >>>     0.0367746
>> >>>      >     17       4      1836      1832    430.98       384
>> >>> 0.0250518
>> >>>     0.0370169
>> >>>      >     18       4      1941      1937   430.366       420
>> >>> 0.027502
>> >>>     0.0371341
>> >>>      >     19       4      2049      2045   430.448       432
>> >>> 0.0260257
>> >>>     0.0370807
>> >>>      > 2015-11-23 12:10:58.587087min lat: 0.0229266 max lat: 0.27063
>> >>> avg
>> >>>     lat:
>> >>>      > 0.0373936
>> >>>      >    sec Cur ops   started  finished  avg MB/s  cur MB/s  last
>> >>> lat
>> >>>       avg lat
>> >>>      >     20       4      2141      2137   427.322       368
>> >>> 0.0351276
>> >>>     0.0373936
>> >>>      >  Total time run:         20.186437
>> >>>      > Total writes made:      2141
>> >>>      > Write size:             4194304
>> >>>      > Bandwidth (MB/sec):     424.245
>> >>>      >
>> >>>      > Stddev Bandwidth:       102.136
>> >>>      > Max bandwidth (MB/sec): 512
>> >>>      > Min bandwidth (MB/sec): 0
>> >>>      > Average Latency:        0.0376536
>> >>>      > Stddev Latency:         0.032886
>> >>>      > Max latency:            0.27063
>> >>>      > Min latency:            0.0229266
>> >>>      >
>> >>>      >
>> >>>      > $ rados bench -p rbd 20 seq --no-cleanup -t 4
>> >>>      >    sec Cur ops   started  finished  avg MB/s  cur MB/s  last
>> >>> lat
>> >>>       avg lat
>> >>>      >      0       0         0         0         0         0
>> >>> -
>> >>>             0
>> >>>      >      1       4       394       390   1559.52      1560
>> >>> 0.0148888
>> >>>     0.0102236
>> >>>      >      2       4       753       749   1496.68      1436
>> >>> 0.0129162
>> >>>     0.0106595
>> >>>      >      3       4      1137      1133   1509.65      1536
>> >>> 0.0101854
>> >>>     0.0105731
>> >>>      >      4       4      1526      1522   1521.17      1556
>> >>> 0.0122154
>> >>>     0.0103827
>> >>>      >      5       4      1890      1886   1508.07
>> >>> 14560.00825445
>> >>>     0.0105908
>> >>>      >  Total time run:        5.675418
>> >>>      > Total reads made:     2141
>> >>>      > Read size:            4194304
>> >>>      > Bandwidth (MB/sec):    1508.964
>> >>>      >
>> >>>      > Average Latency:       0.0105951
>> >>>      > Max latency:           0.211469
>> >>>      > Min latency:           0.00603694
>> >>>      >
>> >>>      >
>> >>>      > I'm not even close to those numbers that you are getting... :(
>> >>>     any ideas? or
>> >>>      > hints? Also I've configured NOOP as the scheduler for all the
>> >>> SSD
>> >>>     disks. I
>> >>>      > don't know really what else to look for, in order to improve
>> >>>     performance and
>> >>>      > get some similar numbers from what you are getting
>> >>>      >
>> >>>      >
>> >>>      > Thanks in advance,
>> >>>      >
>> >>>      > Cheers,
>> >>>      >
>> >>>      >
>> >>>      > German
>> >>>      >
>> >>>      > 2015-11-23 13:32 GMT-03:00 Mark Nelson :
>> >>>      >>
>> >>>      >> Hi German,
>> >>>      >>
>> >>>      >> I don't have exactly the same setup, but on the ceph community
>> >>>     cluster I
>> >>>      >> have tests with:
>> >>>      >>
>> >>>      >> 4 nodes, each of which are configured in some tests with:
>> >>>      >>
>> >>>      >> 2 x Intel Xeon E5-2650
>> >>>      >> 1 x Intel XL710 40GbE (currently limited to about 2.5GB/s
>> >>> each)
>> >>>      >> 1 x Intel P3700 800GB (4 OSDs per card using 4 data and 4
>> >>> journal
>> >>>      >> partitions)
>> >>>      >> 64GB RAM
>> >>>      >>
>> >>>      >> With filestore, I can get an aggregate throughput of:
>> >>>      >>
>> >>>      >> 1MB randread: 8715.3MB/s
>> >>>      >> 4MB randread: 8046.2MB/s
>> >>>      >>
>> >>>      >> This is with 4 fio instances on the same nodes as the OSDs
>> >>> using
>> >>>     the fio
>> >>>      >> librbd engine.
>> >>>      >>
>> >>>      >> A couple of things I would suggest trying:
>> >>>      >>
>> >>>      >> 1) See how rados bench does.  This is an easy test and you can
>> >>>     see how
>> >>>      >> different the numbers look.
>> >>>      >>
>> >>>      >> 2) try fio with librbd to see if it might be a qemu
>> >>> limitation.
>> >>>      >>
>> >>>      >> 3) Assuming you are using IPoIB, try some iperf tests to see
>> >>> how
>> >>>     your
>> >>>      >> network is doing.
>> >>>      >>
>> >>>      >> Mark
>> >>>      >>
>> >>>      >>
>> >>>      >> On 11/23/2015 10:17 AM, German Anders wrote:
>> >>>      >>>
>> >>>      >>> Thanks a lot for the quick update Greg. This lead me to ask
>> >>> if
>> >>>     there's
>> >>>      >>> anything out there to improve performance in an Infiniband
>> >>>     environment
>> >>>      >>> with Ceph. In the cluster that I mentioned earlier. I've
>> >>> setup
>> >>>     4 OSD
>> >>>      >>> server nodes nodes each with 8 OSD daemons running with 800x
>> >>>     Intel SSD
>> >>>      >>> DC S3710 disks (740.2G for OSD and 5G for Journal) and also
>> >>>     using IB FDR
>> >>>      >>> 56Gb/s for the PUB and CLUS network, and I'm getting the
>> >>>     following fio
>> >>>      >>> numbers:
>> >>>      >>>
>> >>>      >>>
>> >>>      >>> # fio --rw=randread --bs=1m --numjobs=4 --iodepth=32
>> >>> --runtime=22
>> >>>      >>> --time_based --size=16777216k --loops=1 --ioengine=libaio
>> >>>     --direct=1
>> >>>      >>> --invalidate=1 --fsync_on_close=1 --randrepeat=1
>> >>> --norandommap
>> >>>      >>> --group_reporting --exitall --name
>> >>>      >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec
>> >>>      >>> --filename=/mnt/rbd/test1
>> >>>      >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (g=0):
>> >>>     rw=randread,
>> >>>      >>> bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32
>> >>>      >>> ...
>> >>>      >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (g=0):
>> >>>     rw=randread,
>> >>>      >>> bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32
>> >>>      >>> fio-2.1.3
>> >>>      >>> Starting 4 processes
>> >>>      >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: Laying out
>> >>> IO
>> >>>     file(s)
>> >>>      >>> (1 file(s) / 16384MB)
>> >>>      >>> Jobs: 4 (f=4): [rrrr] [33.8% done] [1082MB/0KB/0KB /s]
>> >>>     [1081/0/0 iops]
>> >>>      >>> [eta 00m:45s]
>> >>>      >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (groupid=0,
>> >>>     jobs=4):
>> >>>      >>> err= 0: pid=63852: Mon Nov 23 10:48:07 2015
>> >>>      >>>    read : io=21899MB, bw=988.23MB/s, iops=988, runt=
>> >>> 22160msec
>> >>>      >>>      slat (usec): min=192, max=186274, avg=3990.48,
>> >>> stdev=7533.77
>> >>>      >>>      clat (usec): min=10, max=808610, avg=125099.41,
>> >>> stdev=90717.56
>> >>>      >>>       lat (msec): min=6, max=809, avg=129.09, stdev=91.14
>> >>>      >>>      clat percentiles (msec):
>> >>>      >>>       |  1.00th=[   27],  5.00th=[   38], 10.00th=[   45],
>> >>>     20.00th=[
>> >>>      >>> 61],
>> >>>      >>>       | 30.00th=[   74], 40.00th=[   85], 50.00th=[  100],
>> >>>     60.00th=[
>> >>>      >>> 117],
>> >>>      >>>       | 70.00th=[  141], 80.00th=[  174], 90.00th=[  235],
>> >>>     95.00th=[
>> >>>      >>> 297],
>> >>>      >>>       | 99.00th=[  482], 99.50th=[  578], 99.90th=[  717],
>> >>>     99.95th=[
>> >>>      >>> 750],
>> >>>      >>>       | 99.99th=[  775]
>> >>>      >>>      bw (KB  /s): min=134691, max=335872, per=25.08%,
>> >>>     avg=253748.08,
>> >>>      >>> stdev=40454.88
>> >>>      >>>      lat (usec) : 20=0.01%
>> >>>      >>>      lat (msec) : 10=0.02%, 20=0.27%, 50=12.90%, 100=36.93%,
>> >>>     250=41.39%
>> >>>      >>>      lat (msec) : 500=7.59%, 750=0.84%, 1000=0.05%
>> >>>      >>>    cpu          : usr=0.11%, sys=26.76%, ctx=39695, majf=0,
>> >>>     minf=405
>> >>>      >>>    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.3%,
>> >>>     32=99.4%,
>> >>>      >>>  >=64=0.0%
>> >>>      >>>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>> >>>     64=0.0%,
>> >>>      >>>  >=64=0.0%
>> >>>      >>>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%,
>> >>>     64=0.0%,
>> >>>      >>>  >=64=0.0%
>> >>>      >>>       issued    : total=r=21899/w=0/d=0, short=r=0/w=0/d=0
>> >>>      >>>
>> >>>      >>> Run status group 0 (all jobs):
>> >>>      >>>     READ: io=21899MB, aggrb=988.23MB/s, minb=988.23MB/s,
>> >>>      >>> maxb=988.23MB/s, mint=22160msec, maxt=22160msec
>> >>>      >>>
>> >>>      >>> Disk stats (read/write):
>> >>>      >>>    rbd1: ios=43736/163, merge=0/5, ticks=3189484/15276,
>> >>>      >>> in_queue=3214988, util=99.78%
>> >>>      >>>
>> >>>      >>>
>> >>>      >>>
>> >>>      >>>
>> >>>
>> >>>
>> >>> ############################################################################################################################################################
>> >>>      >>>
>> >>>      >>>
>> >>>      >>> # fio --rw=randread --bs=4m --numjobs=4 --iodepth=32
>> >>> --runtime=22
>> >>>      >>> --time_based --size=16777216k --loops=1 --ioengine=libaio
>> >>>     --direct=1
>> >>>      >>> --invalidate=1 --fsync_on_close=1 --randrepeat=1
>> >>> --norandommap
>> >>>      >>> --group_reporting --exitall --name
>> >>>      >>> dev-ceph-randread-4m-4thr-libaio-32iodepth-22sec
>> >>>      >>> --filename=/mnt/rbd/test2
>> >>>      >>>
>> >>>      >>> fio-2.1.3
>> >>>      >>> Starting 4 processes
>> >>>      >>> dev-ceph-randread-4m-4thr-libaio-32iodepth-22sec: Laying out
>> >>> IO
>> >>>     file(s)
>> >>>      >>> (1 file(s) / 16384MB)
>> >>>      >>> Jobs: 4 (f=4): [rrrr] [28.7% done] [894.3MB/0KB/0KB /s]
>> >>>     [223/0/0 iops]
>> >>>      >>> [eta 00m:57s]
>> >>>      >>> dev-ceph-randread-4m-4thr-libaio-32iodepth-22sec: (groupid=0,
>> >>>     jobs=4):
>> >>>      >>> err= 0: pid=64654: Mon Nov 23 10:51:58 2015
>> >>>      >>>    read : io=18952MB, bw=876868KB/s, iops=214, runt=
>> >>> 22132msec
>> >>>      >>>      slat (usec): min=518, max=81398, avg=18576.88,
>> >>> stdev=14840.55
>> >>>      >>>      clat (msec): min=90, max=1915, avg=570.37, stdev=166.51
>> >>>      >>>       lat (msec): min=123, max=1936, avg=588.95, stdev=169.19
>> >>>      >>>      clat percentiles (msec):
>> >>>      >>>       |  1.00th=[  258],  5.00th=[  343], 10.00th=[  383],
>> >>>     20.00th=[
>> >>>      >>> 437],
>> >>>      >>>       | 30.00th=[  482], 40.00th=[  519], 50.00th=[  553],
>> >>>     60.00th=[
>> >>>      >>> 594],
>> >>>      >>>       | 70.00th=[  627], 80.00th=[  685], 90.00th=[  775],
>> >>>     95.00th=[
>> >>>      >>> 865],
>> >>>      >>>       | 99.00th=[ 1057], 99.50th=[ 1156], 99.90th=[ 1680],
>> >>>     99.95th=[
>> >>>      >>> 1860],
>> >>>      >>>       | 99.99th=[ 1909]
>> >>>      >>>      bw (KB  /s): min= 5665, max=383251, per=24.61%,
>> >>> avg=215755.74,
>> >>>      >>> stdev=61735.70
>> >>>      >>>      lat (msec) : 100=0.02%, 250=0.80%, 500=33.88%,
>> >>> 750=53.31%,
>> >>>      >>> 1000=10.26%
>> >>>      >>>      lat (msec) : 2000=1.73%
>> >>>      >>>    cpu          : usr=0.07%, sys=12.52%, ctx=32466, majf=0,
>> >>>     minf=372
>> >>>      >>>    IO depths    : 1=0.1%, 2=0.2%, 4=0.3%, 8=0.7%, 16=1.4%,
>> >>>     32=97.4%,
>> >>>      >>>  >=64=0.0%
>> >>>      >>>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>> >>>     64=0.0%,
>> >>>      >>>  >=64=0.0%
>> >>>      >>>       complete  : 0=0.0%, 4=99.9%, 8=0.0%, 16=0.0%, 32=0.1%,
>> >>>     64=0.0%,
>> >>>      >>>  >=64=0.0%
>> >>>      >>>       issued    : total=r=4738/w=0/d=0, short=r=0/w=0/d=0
>> >>>      >>>
>> >>>      >>> Run status group 0 (all jobs):
>> >>>      >>>     READ: io=18952MB, aggrb=876868KB/s, minb=876868KB/s,
>> >>>      >>> maxb=876868KB/s, mint=22132msec, maxt=22132msec
>> >>>      >>>
>> >>>      >>> Disk stats (read/write):
>> >>>      >>>    rbd1: ios=37721/177, merge=0/5, ticks=3075924/11408,
>> >>>      >>> in_queue=3097448, util=99.77%
>> >>>      >>>
>> >>>      >>>
>> >>>      >>> Can anyone share some results from a similar environment?
>> >>>      >>>
>> >>>      >>> Thanks in advance,
>> >>>      >>>
>> >>>      >>> Best,
>> >>>      >>>
>> >>>      >>> **
>> >>>      >>>
>> >>>      >>> *German*
>> >>>      >>>
>> >>>      >>> 2015-11-23 13:08 GMT-03:00 Gregory Farnum >> >:
>> >>>     >>>
>> >>>     >>>     On Mon, Nov 23, 2015 at 10:05 AM, German Anders
>> >>>      >>>     > wrote:
>> >>>      >>>     > Hi all,
>> >>>      >>>     >
>> >>>      >>>     > I want to know if there's any improvement or update
>> >>>     regarding ceph
>> >>>      >>> 0.94.5
>> >>>      >>>     > with accelio, I've an already configured cluster (with
>> >>> no
>> >>>     data on
>> >>>      >>> it) and I
>> >>>      >>>     > would like to know if there's a way to 'modify' the
>> >>>     cluster in
>> >>>      >>> order to use
>> >>>      >>>     > accelio. Any info would be really appreciated.
>> >>>      >>>
>> >>>      >>>     The XioMessenger is still experimental. As far as I know
>> >>>     it's not
>> >>>      >>>     expected to be stable any time soon and I can't imagine
>> >>> it
>> >>>     will be
>> >>>      >>>     backported to Hammer even when done.
>> >>>      >>>     -Greg
>> >>>      >>>
>> >>>      >>>
>> >>>      >>>
>> >>>      >>>
>> >>>      >>> _______________________________________________
>> >>>      >>> ceph-users mailing list
>> >>>      >>> ceph-users@xxxxxxxxxxxxxx
>> >>>      >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>>      >>>
>> >>>      >> _______________________________________________
>> >>>      >> ceph-users mailing list
>> >>>      >> ceph-users@xxxxxxxxxxxxxx
>> >>>      >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>>      >
>> >>>      >
>> >>>      >
>> >>>      > _______________________________________________
>> >>>      > ceph-users mailing list
>> >>>      > ceph-users@xxxxxxxxxxxxxx
>> >>>      > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>>      >
>> >>>
>> >>>     -----BEGIN PGP SIGNATURE-----
>> >>>     Version: Mailvelope v1.2.3
>> >>>     Comment: https://www.mailvelope.com
>> >>>
>> >>>     wsFcBAEBCAAQBQJWU1WqCRDmVDuy+mK58QAAo5cQALjuZB+dyjbcRDyScvj/
>> >>>     qjurMqCHlScgG9U8CE4L6/E/QUfCNmdvE4KaeQC82oj/SplXYOuglTHJkUMg
>> >>>     KPyjb9jJs+ZyS560IoUB/l/XQZpO9WL+DNnSAg96Hpb3eG+G5jukW9/E/QHQ
>> >>>     aDjn/c1njEqUhxMAosUFZR58CxejyyI5Vr/SXX+oE6y2tCF31Z3KPiOVTOtj
>> >>>     BPIx74xpigXMSP+zaK4UelhjPzrRnefkN2sLpQS5uwJlOY1f35KoM3dX+LHO
>> >>>     2BWpyrLUtL6ZzpalKr/QbaWko1VM109vjAoPZ3X82ig9DZp2DW8ZVX4abVcy
>> >>>     +Zyre4SCncKFJZcL9VkQHPJxRFhqXHC43mpSHIKmhuhmGVwr9ngiKGUY1Q7t
>> >>>     O0aks06KHfqSRxjWmuhtP0eMLwsH7gLAEqqtAjnIhRTCDDkhRdp/MdZJ7ftO
>> >>>     LHF9+Eqdp/KiVrGK7BX9zwVshr608bR4g7JCfK4/ukSHXOWFVR6GZ8jue85q
>> >>>     e6dWhHsdwrPt1QnSrfhnKjoMdhTpvPVzlxqo2jHDXEyE57RxW/zXr776HxcQ
>> >>>     cISj4zDZ0nGZ1F8w4DdB0ql8CpsCDAEoaNG0ZQPXcItyrHIB0lFOJYDi5m+4
>> >>>     YqOCG8TWh7b28IbEEwwUSpx3pi2iyH0ObJZM5dgf62AOCKCEsixf+UguFVwd
>> >>>     /jdL
>> >>>     =6LtO
>> >>>     -----END PGP SIGNATURE-----
>> >>>
>> >>>
>> >
>
>

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWVJ/DCRDmVDuy+mK58QAAku0P/0u8bDO/bAK9YxPHE0iP
ZFOI+mz8QSZ7Omu/3DKlMBtj81Amv781j0jc6iIFBijYGUBeTlr8HPZgsBry
Y5MsDXxrQPiboWSQ7kF8cOfOlZd+JmSSnmum6Gw9P7XzZAXfLwuEIvZitIHv
ivHJPykj34A3ZdJByuXdQG929nIOLRKUKdXnIXOMrNQQWSwn4u/v4pGbYchN
EIhjAFLN+maKAE+KXgARkwDpXKMw+//Tu435GzDGzwXND6m9Vk1JKsJ42qvv
3D5In0xtNuKayUDLwv0WCsQrkysomY/H+PgMoa2Nu8Uo9jQQqH36S/KMyq6e
vjbhuwrJ1ZnWKtdhixCk5fC66D41kchOPmFqXwKBBczAWj2HfO8/naFRqX6G
8IePIkCZ80PVqRO4n8/IDp/JMMU6y6PfWhU+1VI/HHPaxqBvHM6RiaGLy+b7
F3wbhcM6WrFfLMn5jjhkBhL+A/s1Z+Zwzg19wVGFzVFuFj6Tzn+mgWvocIF3
GY1Ii3R5QR+z1IC6RJYHbf/jJgPiwnh5/sPy9WV/td2sQrrt2Bg4Bj8mNa9l
3adcWQnaZuATi9wOClGPI/R1mZrXLna3QnDTuxlibEYf/XpEi4iE96/tH6N1
92HwHpnWhO5roS0sug7YS3uqZ81EX5t+5SoX07Y6ZOln2i8f7TvhZbXaDB13
beNr
=ZIWh
-----END PGP SIGNATURE-----

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux