Hi Mark,
Thanks a lot for the quick response. Regarding the numbers that you send me, they look REALLY nice. I've the following setup
4 OSD nodes:
2 x Intel Xeon E5-2650v2 @2.60Ghz
1 x Network controller: Mellanox Technologies MT27500 Family [ConnectX-3] Dual-Port (1 for PUB and 1 for CLUS)
2 x Intel Xeon E5-2650v2 @2.60Ghz
1 x Network controller: Mellanox Technologies MT27500 Family [ConnectX-3] Dual-Port (1 for PUB and 1 for CLUS)
1 x SAS2308 PCI-Express Fusion-MPT SAS-2
8 x Intel SSD DC S3510 800GB (1 OSD on each drive + journal on the same drive, so 1:1 relationship)
3 x Intel SSD DC S3710 200GB (to be used maybe as a cache tier)
128GB RAM
[0:0:0:0] disk ATA INTEL SSDSC2BA20 0110 /dev/sdc
[0:0:1:0] disk ATA INTEL SSDSC2BA20 0110 /dev/sdd
[0:0:2:0] disk ATA INTEL SSDSC2BA20 0110 /dev/sde
[0:0:3:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdf
[0:0:4:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdg
[0:0:5:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdh
[0:0:6:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdi
[0:0:7:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdj
[0:0:8:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdk
[0:0:9:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdl
[0:0:10:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdm
sdf 8:80 0 745.2G 0 disk
|-sdf1 8:81 0 740.2G 0 part /var/lib/ceph/osd/ceph-16
`-sdf2 8:82 0 5G 0 part
sdg 8:96 0 745.2G 0 disk
|-sdg1 8:97 0 740.2G 0 part /var/lib/ceph/osd/ceph-17
`-sdg2 8:98 0 5G 0 part
sdh 8:112 0 745.2G 0 disk
|-sdh1 8:113 0 740.2G 0 part /var/lib/ceph/osd/ceph-18
`-sdh2 8:114 0 5G 0 part
sdi 8:128 0 745.2G 0 disk
|-sdi1 8:129 0 740.2G 0 part /var/lib/ceph/osd/ceph-19
`-sdi2 8:130 0 5G 0 part
sdj 8:144 0 745.2G 0 disk
|-sdj1 8:145 0 740.2G 0 part /var/lib/ceph/osd/ceph-20
`-sdj2 8:146 0 5G 0 part
sdk 8:160 0 745.2G 0 disk
|-sdk1 8:161 0 740.2G 0 part /var/lib/ceph/osd/ceph-21
`-sdk2 8:162 0 5G 0 part
sdl 8:176 0 745.2G 0 disk
|-sdl1 8:177 0 740.2G 0 part /var/lib/ceph/osd/ceph-22
`-sdl2 8:178 0 5G 0 part
sdm 8:192 0 745.2G 0 disk
|-sdm1 8:193 0 740.2G 0 part /var/lib/ceph/osd/ceph-23
`-sdm2 8:194 0 5G 0 part
$ rados bench -p rbd 20 write --no-cleanup -t 4
Maintaining 4 concurrent writes of 4194304 bytes for up to 20 seconds or 0 objects
Object prefix: benchmark_data_cibm01_1409
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
0 0 0 0 0 0 - 0
1 4 121 117 467.894 468 0.0337203 0.0336809
2 4 244 240 479.895 492 0.0304306 0.0330524
3 4 372 368 490.559 512 0.0361914 0.0323822
4 4 491 487 486.899 476 0.0346544 0.0327169
5 4 587 583 466.302 384 0.110718 0.0342427
6 4 701 697 464.575 456 0.0324953 0.0343136
7 4 811 807 461.053 440 0.0400344 0.0345994
8 4 923 919 459.412 448 0.0255677 0.0345767
9 4 1032 1028 456.803 436 0.0309743 0.0349256
10 4 1119 1115 445.917 348 0.229508 0.0357856
11 4 1222 1218 442.826 412 0.0277902 0.0360635
12 4 1315 1311 436.919 372 0.0303377 0.0365673
13 4 1424 1420 436.842 436 0.0288001 0.03659
14 4 1524 1520 434.206 400 0.0360993 0.0367697
15 4 1632 1628 434.054 432 0.0296406 0.0366877
16 4 1740 1736 433.921 432 0.0310995 0.0367746
17 4 1836 1832 430.98 384 0.0250518 0.0370169
18 4 1941 1937 430.366 420 0.027502 0.0371341
19 4 2049 2045 430.448 432 0.0260257 0.0370807
2015-11-23 12:10:58.587087min lat: 0.0229266 max lat: 0.27063 avg lat: 0.0373936
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
20 4 2141 2137 427.322 368 0.0351276 0.0373936
Total time run: 20.186437
Total writes made: 2141
Write size: 4194304
Bandwidth (MB/sec): 424.245
Stddev Bandwidth: 102.136
Max bandwidth (MB/sec): 512
Min bandwidth (MB/sec): 0
Average Latency: 0.0376536
Stddev Latency: 0.032886
Max latency: 0.27063
Min latency: 0.0229266
$ rados bench -p rbd 20 seq --no-cleanup -t 4
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
0 0 0 0 0 0 - 0
1 4 394 390 1559.52 1560 0.0148888 0.0102236
2 4 753 749 1496.68 1436 0.0129162 0.0106595
3 4 1137 1133 1509.65 1536 0.0101854 0.0105731
4 4 1526 1522 1521.17 1556 0.0122154 0.0103827
5 4 1890 1886 1508.07 14560.00825445 0.0105908
Total time run: 5.675418
Total reads made: 2141
Read size: 4194304
Bandwidth (MB/sec): 1508.964
Average Latency: 0.0105951
Max latency: 0.211469
Min latency: 0.00603694
[0:0:0:0] disk ATA INTEL SSDSC2BA20 0110 /dev/sdc
[0:0:1:0] disk ATA INTEL SSDSC2BA20 0110 /dev/sdd
[0:0:2:0] disk ATA INTEL SSDSC2BA20 0110 /dev/sde
[0:0:3:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdf
[0:0:4:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdg
[0:0:5:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdh
[0:0:6:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdi
[0:0:7:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdj
[0:0:8:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdk
[0:0:9:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdl
[0:0:10:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdm
sdf 8:80 0 745.2G 0 disk
|-sdf1 8:81 0 740.2G 0 part /var/lib/ceph/osd/ceph-16
`-sdf2 8:82 0 5G 0 part
sdg 8:96 0 745.2G 0 disk
|-sdg1 8:97 0 740.2G 0 part /var/lib/ceph/osd/ceph-17
`-sdg2 8:98 0 5G 0 part
sdh 8:112 0 745.2G 0 disk
|-sdh1 8:113 0 740.2G 0 part /var/lib/ceph/osd/ceph-18
`-sdh2 8:114 0 5G 0 part
sdi 8:128 0 745.2G 0 disk
|-sdi1 8:129 0 740.2G 0 part /var/lib/ceph/osd/ceph-19
`-sdi2 8:130 0 5G 0 part
sdj 8:144 0 745.2G 0 disk
|-sdj1 8:145 0 740.2G 0 part /var/lib/ceph/osd/ceph-20
`-sdj2 8:146 0 5G 0 part
sdk 8:160 0 745.2G 0 disk
|-sdk1 8:161 0 740.2G 0 part /var/lib/ceph/osd/ceph-21
`-sdk2 8:162 0 5G 0 part
sdl 8:176 0 745.2G 0 disk
|-sdl1 8:177 0 740.2G 0 part /var/lib/ceph/osd/ceph-22
`-sdl2 8:178 0 5G 0 part
sdm 8:192 0 745.2G 0 disk
|-sdm1 8:193 0 740.2G 0 part /var/lib/ceph/osd/ceph-23
`-sdm2 8:194 0 5G 0 part
$ rados bench -p rbd 20 write --no-cleanup -t 4
Maintaining 4 concurrent writes of 4194304 bytes for up to 20 seconds or 0 objects
Object prefix: benchmark_data_cibm01_1409
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
0 0 0 0 0 0 - 0
1 4 121 117 467.894 468 0.0337203 0.0336809
2 4 244 240 479.895 492 0.0304306 0.0330524
3 4 372 368 490.559 512 0.0361914 0.0323822
4 4 491 487 486.899 476 0.0346544 0.0327169
5 4 587 583 466.302 384 0.110718 0.0342427
6 4 701 697 464.575 456 0.0324953 0.0343136
7 4 811 807 461.053 440 0.0400344 0.0345994
8 4 923 919 459.412 448 0.0255677 0.0345767
9 4 1032 1028 456.803 436 0.0309743 0.0349256
10 4 1119 1115 445.917 348 0.229508 0.0357856
11 4 1222 1218 442.826 412 0.0277902 0.0360635
12 4 1315 1311 436.919 372 0.0303377 0.0365673
13 4 1424 1420 436.842 436 0.0288001 0.03659
14 4 1524 1520 434.206 400 0.0360993 0.0367697
15 4 1632 1628 434.054 432 0.0296406 0.0366877
16 4 1740 1736 433.921 432 0.0310995 0.0367746
17 4 1836 1832 430.98 384 0.0250518 0.0370169
18 4 1941 1937 430.366 420 0.027502 0.0371341
19 4 2049 2045 430.448 432 0.0260257 0.0370807
2015-11-23 12:10:58.587087min lat: 0.0229266 max lat: 0.27063 avg lat: 0.0373936
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
20 4 2141 2137 427.322 368 0.0351276 0.0373936
Total time run: 20.186437
Total writes made: 2141
Write size: 4194304
Bandwidth (MB/sec): 424.245
Stddev Bandwidth: 102.136
Max bandwidth (MB/sec): 512
Min bandwidth (MB/sec): 0
Average Latency: 0.0376536
Stddev Latency: 0.032886
Max latency: 0.27063
Min latency: 0.0229266
$ rados bench -p rbd 20 seq --no-cleanup -t 4
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
0 0 0 0 0 0 - 0
1 4 394 390 1559.52 1560 0.0148888 0.0102236
2 4 753 749 1496.68 1436 0.0129162 0.0106595
3 4 1137 1133 1509.65 1536 0.0101854 0.0105731
4 4 1526 1522 1521.17 1556 0.0122154 0.0103827
5 4 1890 1886 1508.07 14560.00825445 0.0105908
Total time run: 5.675418
Total reads made: 2141
Read size: 4194304
Bandwidth (MB/sec): 1508.964
Average Latency: 0.0105951
Max latency: 0.211469
Min latency: 0.00603694
I'm
not even close to those numbers that you are getting... :( any ideas?
or hints? Also I've configured NOOP as the scheduler for all the SSD
disks. I don't know really what else to look for, in order to improve
performance and get some similar numbers from what you are getting
Thanks in advance,
Cheers,
German
2015-11-23 13:32 GMT-03:00 Mark Nelson <mnelson@xxxxxxxxxx>:
Hi German,
I don't have exactly the same setup, but on the ceph community cluster I have tests with:
4 nodes, each of which are configured in some tests with:
2 x Intel Xeon E5-2650
1 x Intel XL710 40GbE (currently limited to about 2.5GB/s each)
1 x Intel P3700 800GB (4 OSDs per card using 4 data and 4 journal partitions)
64GB RAM
With filestore, I can get an aggregate throughput of:
1MB randread: 8715.3MB/s
4MB randread: 8046.2MB/s
This is with 4 fio instances on the same nodes as the OSDs using the fio librbd engine.
A couple of things I would suggest trying:
1) See how rados bench does. This is an easy test and you can see how different the numbers look.
2) try fio with librbd to see if it might be a qemu limitation.
3) Assuming you are using IPoIB, try some iperf tests to see how your network is doing.
Mark
On 11/23/2015 10:17 AM, German Anders wrote:
_______________________________________________**Thanks a lot for the quick update Greg. This lead me to ask if there's
anything out there to improve performance in an Infiniband environment
with Ceph. In the cluster that I mentioned earlier. I've setup 4 OSD
server nodes nodes each with 8 OSD daemons running with 800x Intel SSD
DC S3710 disks (740.2G for OSD and 5G for Journal) and also using IB FDR
56Gb/s for the PUB and CLUS network, and I'm getting the following fio
numbers:
# fio --rw=randread --bs=1m --numjobs=4 --iodepth=32 --runtime=22
--time_based --size=16777216k --loops=1 --ioengine=libaio --direct=1
--invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap
--group_reporting --exitall --name
dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec --filename=/mnt/rbd/test1
dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (g=0): rw=randread,
bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32
...
dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (g=0): rw=randread,
bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32
fio-2.1.3
Starting 4 processes
dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: Laying out IO file(s)
(1 file(s) / 16384MB)
Jobs: 4 (f=4): [rrrr] [33.8% done] [1082MB/0KB/0KB /s] [1081/0/0 iops]
[eta 00m:45s]
dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (groupid=0, jobs=4):
err= 0: pid=63852: Mon Nov 23 10:48:07 2015
read : io=21899MB, bw=988.23MB/s, iops=988, runt= 22160msec
slat (usec): min=192, max=186274, avg=3990.48, stdev=7533.77
clat (usec): min=10, max=808610, avg=125099.41, stdev=90717.56
lat (msec): min=6, max=809, avg=129.09, stdev=91.14
clat percentiles (msec):
| 1.00th=[ 27], 5.00th=[ 38], 10.00th=[ 45], 20.00th=[ 61],
| 30.00th=[ 74], 40.00th=[ 85], 50.00th=[ 100], 60.00th=[ 117],
| 70.00th=[ 141], 80.00th=[ 174], 90.00th=[ 235], 95.00th=[ 297],
| 99.00th=[ 482], 99.50th=[ 578], 99.90th=[ 717], 99.95th=[ 750],
| 99.99th=[ 775]
bw (KB /s): min=134691, max=335872, per=25.08%, avg=253748.08,
stdev=40454.88
lat (usec) : 20=0.01%
lat (msec) : 10=0.02%, 20=0.27%, 50=12.90%, 100=36.93%, 250=41.39%
lat (msec) : 500=7.59%, 750=0.84%, 1000=0.05%
cpu : usr=0.11%, sys=26.76%, ctx=39695, majf=0, minf=405
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.3%, 32=99.4%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
>=64=0.0%
issued : total=r=21899/w=0/d=0, short=r=0/w=0/d=0
Run status group 0 (all jobs):
READ: io=21899MB, aggrb=988.23MB/s, minb=988.23MB/s,
maxb=988.23MB/s, mint=22160msec, maxt=22160msec
Disk stats (read/write):
rbd1: ios=43736/163, merge=0/5, ticks=3189484/15276,
in_queue=3214988, util=99.78%
############################################################################################################################################################
# fio --rw=randread --bs=4m --numjobs=4 --iodepth=32 --runtime=22
--time_based --size=16777216k --loops=1 --ioengine=libaio --direct=1
--invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap
--group_reporting --exitall --name
dev-ceph-randread-4m-4thr-libaio-32iodepth-22sec --filename=/mnt/rbd/test2
fio-2.1.3
Starting 4 processes
dev-ceph-randread-4m-4thr-libaio-32iodepth-22sec: Laying out IO file(s)
(1 file(s) / 16384MB)
Jobs: 4 (f=4): [rrrr] [28.7% done] [894.3MB/0KB/0KB /s] [223/0/0 iops]
[eta 00m:57s]
dev-ceph-randread-4m-4thr-libaio-32iodepth-22sec: (groupid=0, jobs=4):
err= 0: pid=64654: Mon Nov 23 10:51:58 2015
read : io=18952MB, bw=876868KB/s, iops=214, runt= 22132msec
slat (usec): min=518, max=81398, avg=18576.88, stdev=14840.55
clat (msec): min=90, max=1915, avg=570.37, stdev=166.51
lat (msec): min=123, max=1936, avg=588.95, stdev=169.19
clat percentiles (msec):
| 1.00th=[ 258], 5.00th=[ 343], 10.00th=[ 383], 20.00th=[ 437],
| 30.00th=[ 482], 40.00th=[ 519], 50.00th=[ 553], 60.00th=[ 594],
| 70.00th=[ 627], 80.00th=[ 685], 90.00th=[ 775], 95.00th=[ 865],
| 99.00th=[ 1057], 99.50th=[ 1156], 99.90th=[ 1680], 99.95th=[ 1860],
| 99.99th=[ 1909]
bw (KB /s): min= 5665, max=383251, per=24.61%, avg=215755.74,
stdev=61735.70
lat (msec) : 100=0.02%, 250=0.80%, 500=33.88%, 750=53.31%, 1000=10.26%
lat (msec) : 2000=1.73%
cpu : usr=0.07%, sys=12.52%, ctx=32466, majf=0, minf=372
IO depths : 1=0.1%, 2=0.2%, 4=0.3%, 8=0.7%, 16=1.4%, 32=97.4%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=99.9%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
>=64=0.0%
issued : total=r=4738/w=0/d=0, short=r=0/w=0/d=0
Run status group 0 (all jobs):
READ: io=18952MB, aggrb=876868KB/s, minb=876868KB/s,
maxb=876868KB/s, mint=22132msec, maxt=22132msec
Disk stats (read/write):
rbd1: ios=37721/177, merge=0/5, ticks=3075924/11408,
in_queue=3097448, util=99.77%
Can anyone share some results from a similar environment?
Thanks in advance,
Best,
*German*
2015-11-23 13:08 GMT-03:00 Gregory Farnum <gfarnum@xxxxxxxxxx
<mailto:gfarnum@xxxxxxxxxx>>:
On Mon, Nov 23, 2015 at 10:05 AM, German Anders
<ganders@xxxxxxxxxxxx <mailto:ganders@xxxxxxxxxxxx>> wrote:
> Hi all,
>
> I want to know if there's any improvement or update regarding ceph 0.94.5
> with accelio, I've an already configured cluster (with no data on it) and I
> would like to know if there's a way to 'modify' the cluster in order to use
> accelio. Any info would be really appreciated.
The XioMessenger is still experimental. As far as I know it's not
expected to be stable any time soon and I can't imagine it will be
backported to Hammer even when done.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com