Hello Mark,
ah thanks for that information, I have found the problem causing
confusion with rados bench.
If you ever do a
rados -p <pool> bench <time> write --no-cleanup
without any -b parameter, it creates 4M block objects. So far so good.
If you subsequently read with a -b parameter from that pool, it will
still read the entire block objects, no matter what you set -b to, but
it WILL display bandwidth as if the block read were -b size large,
leading to a much lower bandwidth displayed during the test run as it
should be, if the size in -b is lower than 4M.
In the result summary, bandwidth speed will be correct, but not during
the running test where one line per second is printed which is directly
ops * -b size -> bandwidth conversion and since -b is not the amount
actually read, displayed value will be wrong (in the case of 4k vs 4M
that's 3 magnitudes).
Example (called back to back, the cluster isn't changing speed):
root@control1:~# rados -p rados bench 10 rand -t 16
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
0 0 0 0 0 0 - 0
1 16 506 490 1959.59 1960 0.018479 0.0313537
2 16 987 971 1941.66 1924 0.015547 0.0326188
^C
root@control1:~# rados -p rados bench 10 rand -t 16 -b 4096
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
0 0 0 0 0 0 - 0
1 16 472 456 1.78072 1.78125 0.010168 0.0342658
2 16 975 959 1.87266 1.96484 0.023054 0.0323466
^C
Perhaps rados should print a big warning whenever -b parameter does not
match object size during a read / random bench run or outright refuse to
run?
Kind regards
René
On 20.11.2014 16:55, Mark Nelson wrote:
Hi Rene,
The easiest way to check is to create a fresh pool and look at the files
that are created under an OSD for a PG associated with that pool. Here's
an example using firefly:
perf@magna003:/$ ceph-osd --version
ceph version 0.80.7-129-gc069bce (c069bce4e8180da3c0ca4951365032a45df76468)
perf@magna003:/$ ceph osd pool create foo 1024 1024
perf@magna003:/$ ceph osd lspools
0 data,1 metadata,2 rbd,3 cbt-kernelrbdfio,4 foo,
perf@magna003:/$ ceph pg dump | grep "^4\." | tail -n 1
dumped all in format plain
4.4 0 0 0 0 0 0 0 active+clean 2014-11-20
03:46:53.407228 0'0 41:9 [1,7,3] 1 [1,7,3] 1 0'0
2014-11-20 03:46:32.986234 0'0 2014-11-20 03:46:32.986234
perf@magna003:/$ rados -p foo bench 30 write -b 8192 -t 16
...
perf@magna004:/$ ls -al /tmp/cbt/mnt/osd-device-1-data/current/4.4_head
total 236
drwxr-xr-x 2 root root 627 Nov 20 03:54 .
drwxr-xr-x 2213 root root 65536 Nov 20 03:46 ..
-rw-r--r-- 1 root root 8192 Nov 20 03:53
benchmark\udata\umagna003\u23623\uobject1386__head_69872404__4
-rw-r--r-- 1 root root 8192 Nov 20 03:53
benchmark\udata\umagna003\u23623\uobject1630__head_93B04C04__4
-rw-r--r-- 1 root root 8192 Nov 20 03:54
benchmark\udata\umagna003\u23623\uobject3533__head_EF16A404__4
-rw-r--r-- 1 root root 8192 Nov 20 03:54
benchmark\udata\umagna003\u23623\uobject4455__head_D77FB404__4
-rw-r--r-- 1 root root 8192 Nov 20 03:54
benchmark\udata\umagna003\u23623\uobject6346__head_2CD39004__4
-rw-r--r-- 1 root root 8192 Nov 20 03:54
benchmark\udata\umagna003\u23623\uobject6366__head_1C035804__4
-rw-r--r-- 1 root root 8192 Nov 20 03:54
benchmark\udata\umagna003\u23623\uobject7345__head_D4F0F804__4
-rw-r--r-- 1 root root 8192 Nov 20 03:54
benchmark\udata\umagna003\u23623\uobject8425__head_C961F404__4
-rw-r--r-- 1 root root 8192 Nov 20 03:54
benchmark\udata\umagna003\u23623\uobject9199__head_9C2B4404__4
Mark
On 11/20/2014 09:15 AM, René Gallati wrote:
Hello Mark,
sorry for barging in there but are you sure this is correct? In my tests
the -b parameter in rados bench does exactly one thing and that is it
uses the value in its output to calculate IO bandwidth: taking the OPS
value and multiplies it with the -b value for display. However it
*always* performs 4M blocksize operations on the cluster. At least mine
does it:
rados -v
ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
When I use -b 4096 it tells me I have 1.xx MBytes/sec bandwidth, however
looking at the network interfaces it pulls in 16 Gbit/sec (on 2x10Gb
LACP) which is the same value / ballpark when I just leave the parameter
away and it uses 4M blocksize. Also the IOPS values it displays are the
same in both cases.
So for me, the -b does nothing useful at all in rados bench except
falsifying the displayed value for bandwidth. I consider -b in rados
bench broken right now.
Example:
1) rados -p rbd_bench bench 10 rand -t 32 --run-name three
sec Cur ops started finished avg MB/s cur MB/s last lat
avg lat
0 0 0 0 0 0
- 0
1 32 552 520 2079.62 2080 0.074611
0.0576271
2 31 961 930 1859.66 1640 0.035957
0.0664811
3 32 1451 1419 1891.69 1956 0.029887
0.0664829
4 31 1984 1953 1952.68 2136 0.070463
0.0643797
5 31 2505 2474 1978.9 2084 0.253059
0.0639347
6 31 2971 2940 1959.71 1864 0.032281
0.0634925
7 31 3463 3432 1960.86 1968 0.03457
0.0647746
8 32 3966 3934 1966.72 2008 0.037449
0.0643011
9 32 4362 4330 1924.18 1584 0.043155
0.0661253
10 32 4882 4850 1939.73 2080 0.034571
0.0655373
Total time run: 10.105738
Total reads made: 4882
Read size: 4194304
Bandwidth (MB/sec): 1932.368
Average Latency: 0.0661377
Max latency: 0.890115
Min latency: 0.022432
network bandwidth on bond via bwm-ng:
14.90 Gb/s RX
2) rados -p rbd_bench bench 10 rand -t 32 -b 4096 --run-name three
sec Cur ops started finished avg MB/s cur MB/s last lat
avg lat
0 0 0 0 0 0
- 0
1 32 523 491 1.91749 1.91797 0.03142
0.0594356
2 31 1043 1012 1.97591 2.03516 0.05142
0.061328
3 31 1503 1472 1.91617 1.79688 0.165884
0.064101
4 32 1998 1966 1.9195 1.92969 0.037625
0.0637315
5 32 2510 2478 1.93555 2 0.034491
0.0639315
6 31 2948 2917 1.89873 1.71484 0.033597
0.0652374
7 32 3446 3414 1.90479 1.94141 0.038067
0.0649422
8 32 3915 3883 1.89566 1.83203 0.201129
0.0653705
9 31 4419 4388 1.90419 1.97266 0.079625
0.065216
10 32 4918 4886 1.90827 1.94531 0.037762
0.0651324
Total time run: 10.167264
Total reads made: 4918
Read size: 4194304
Bandwidth (MB/sec): 1934.837
Average Latency: 0.0658465
Max latency: 0.788744
Min latency: 0.018118
network bandwidht on bond via bwm-ng:
15.53 Gb/s RX
So you see, output varies tremendously, network and cluster activity
(and ops column) does not. -b does not work.
Examples are on a non-productive cluster with 7 servers with SSD osds
exlusively, everything 2x10GBit lacp bond network.
Kind regards
René
On 20.11.2014 15:15, Mark Nelson wrote:
Hi Jay,
The -b parameter to rados bench controls the size of the object being
written. previously you were writing out 8KB objects which behind the
scenes translates into writing out lots of small files on the OSDs
behind the scenes. Your DD tests were doing 1MB writes which are much
larger and more sequential in nature (ie less moving the head of the
harddrives around on the platter so more time can be spent writing data
out). The new rados bench test I had you run wrote out 4MB objects
which are larger yet and roughly the sweet spot as far as Ceph is
concerned for sequential reads/writes.
Hope this helps!
Mark
On 11/20/2014 07:31 AM, Jay Janardhan wrote:
Hi Mark,
The results are below. These numbers look good but I'm not really sure
what to conclude now.
# rados -p performance_test bench 120 write -b 4194304 -t 100
--no-cleanup
Total time run: 120.133251
Total writes made: 17529
Write size: 4194304
Bandwidth (MB/sec): 583.652
Stddev Bandwidth: 269.76
Max bandwidth (MB/sec): 884
Min bandwidth (MB/sec): 0
Average Latency: 0.68418
Stddev Latency: 0.552344
Max latency: 5.06959
Min latency: 0.121746
# rados -p performance_test bench 120 seq -b 4194304 -t 100
Total time run: 58.451831
Total reads made: 17529
Read size: 4194304
Bandwidth (MB/sec): 1199.552
Average Latency: 0.332538
Max latency: 3.72943
Min latency: 0.007074
On Wed, Nov 19, 2014 at 8:55 PM, Mark Nelson <mark.nelson@xxxxxxxxxxx
<mailto:mark.nelson@xxxxxxxxxxx>> wrote:
On 11/19/2014 06:51 PM, Jay Janardhan wrote:
Can someone help me what I can tune to improve the
performance? The
cluster is pushing data at about 13 MB/s with a single copy of
data
while the underlying disks can push 100+MB/s.
Can anyone help me with this?
*rados bench results:*
Concurrency Replication size Write(MB/s) Seq
Read(MB/s)
321 13.532.8
32212.732.0
3236.130.2
*Commands I used (Pool size was updated appropriately):*
rados -p performance_test bench 120 write -b 8192 -t 100
--no-cleanup
rados -p performance_test bench 120 seq -t 100
How's performance if you do:
rados -p performance_test bench 120 write -b 4194304 -t 100
--no-cleanup
and
rados -p performance_test bench 120 seq -b 4194304 -t 100
instead?
Mark
1) *Disk tests - All have similar numbers:*
# dd if=/dev/zero of=here bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 10.0691 s, 107 MB/s
2) *10G network is not holding up*
# iperf -c 10.13.10.15 -i2 -t 10
------------------------------__------------------------------
Client connecting to 10.13.10.15, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------__------------------------------
[ 3] local 10.13.30.13 port 56459 connected with 10.13.10.15
port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 2.0 sec 2.17 GBytes 9.33 Gbits/sec
[ 3] 2.0- 4.0 sec 2.18 GBytes 9.37 Gbits/sec
[ 3] 4.0- 6.0 sec 2.18 GBytes 9.37 Gbits/sec
[ 3] 6.0- 8.0 sec 2.18 GBytes 9.38 Gbits/sec
[ 3] 8.0-10.0 sec 2.18 GBytes 9.37 Gbits/sec
[ 3] 0.0-10.0 sec 10.9 GBytes 9.36 Gbits/sec
*3) Ceph Status*
# ceph health
HEALTH_OK
root@us1-r04u05s01-ceph:~# ceph status
cluster 5e95b6fa-0b99-4c31-8aa9-__7a88h6hc5eda
health HEALTH_OK
monmap e4: 4 mons at
{us1-r01u05s01-ceph=10.1.30.__10:6789/0,us1-r01u09s01-ceph=__10.1.30.11:6789/0,us1-__r04u05s01-ceph=10.1.30.14:__6789/0,us1-r04u09s01-ceph=10.__1.30.15:6789/0
<http://10.1.30.10:6789/0,us1-r01u09s01-ceph=10.1.30.11:6789/0,us1-r04u05s01-ceph=10.1.30.14:6789/0,us1-r04u09s01-ceph=10.1.30.15:6789/0>
<http://10.1.30.10:6789/0,us1-__r01u09s01-ceph=10.1.30.11:__6789/0,us1-r04u05s01-ceph=10.__1.30.14:6789/0,us1-r04u09s01-__ceph=10.1.30.15:6789/0
<http://10.1.30.10:6789/0,us1-r01u09s01-ceph=10.1.30.11:6789/0,us1-r04u05s01-ceph=10.1.30.14:6789/0,us1-r04u09s01-ceph=10.1.30.15:6789/0>>},
election epoch 78, quorum 0,1,2,3
us1-r01u05s01-ceph,us1-__r01u09s01-ceph,us1-r04u05s01-__ceph,us1-r04u09s01-ceph
osdmap e1029: 97 osds: 97 up, 97 in
pgmap v1850869: 12480 pgs, 6 pools, 587 GB data, 116
kobjects
1787 GB used, 318 TB / 320 TB avail
12480 active+clean
client io 0 B/s rd, 25460 B/s wr, 20 op/s
*4) Ceph configuration*
# cat ceph.conf
[global]
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
cephx require signatures = True
cephx cluster require signatures = True
cephx service require signatures = False
fsid = 5e95b6fa-0b99-4c31-8aa9-__7a88h6hc5eda
osd pool default pg num = 4096
osd pool default pgp num = 4096
osd pool default size = 3
osd pool default min size = 1
osd pool default crush rule = 0
# Disable in-memory logs
debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_mon = 0/0
debug_paxos = 0/0
debug_rgw = 0/0
[mon]
mon osd down out interval = 600
mon osd min down reporters = 2
[mon.us1-r01u05s01-ceph]
host = us1-r01u05s01-ceph
mon addr = 10.1.30.10
[mon.us1-r01u09s01-ceph]
host = us1-r01u09s01-ceph
mon addr = 10.1.30.11
[mon.us1-r04u05s01-ceph]
host = us1-r04u05s01-ceph
mon addr = 10.1.30.14
[mon.us1-r04u09s01-ceph]
host = us1-r04u09s01-ceph
mon addr = 10.1.30.15
[osd]
osd mkfs type = xfs
osd mkfs options xfs = -f -i size=2048
osd mount options xfs = noatime
osd journal size = 10000
cluster_network = 10.2.0.0/16 <http://10.2.0.0/16>
<http://10.2.0.0/16>
public_network = 10.1.0.0/16 <http://10.1.0.0/16>
<http://10.1.0.0/16>
osd mon heartbeat interval = 30
# Performance tuning
filestore merge threshold = 40
filestore split multiple = 8
osd op threads = 8
filestore op threads = 8
filestore max sync interval = 5
osd max scrubs = 1
# Recovery tuning
osd recovery max active = 5
osd max backfills = 2
osd recovery op priority = 2
osd recovery max chunk = 8388608
osd recovery threads = 1
osd objectstore = filestore
osd crush update on start = true
[mds]
*5) Ceph OSDs/Crushmap*
# ceph osd tree
# idweighttype nameup/downreweight
-145.46root fusion_drives
-115.46rack rack01-fusion
-72.73host us1-r01u25s01-compf-fusion
892.73osd.89up1
-92.73host us1-r01u23s01-compf-fusion
962.73osd.96up1
-13315.2root sata_drives
-10166rack rack01-sata
-276.44host us1-r01u05s01-ceph
03.64osd.0up1
13.64osd.1up1
23.64osd.2up1
33.64osd.3up1
43.64osd.4up1
53.64osd.5up1
63.64osd.6up1
73.64osd.7up1
83.64osd.8up1
93.64osd.9up1
103.64osd.10up1
113.64osd.11up1
123.64osd.12up1
133.64osd.13up1
143.64osd.14up1
153.64osd.15up1
163.64osd.16up1
173.64osd.17up1
183.64osd.18up1
193.64osd.19up1
203.64osd.20up1
-376.44host us1-r01u09s01-ceph
213.64osd.21up1
223.64osd.22up1
233.64osd.23up1
243.64osd.24up1
253.64osd.25up1
263.64osd.26up1
273.64osd.27up1
283.64osd.28up1
293.64osd.29up1
303.64osd.30up1
313.64osd.31up1
323.64osd.32up1
333.64osd.33up1
343.64osd.34up1
353.64osd.35up1
363.64osd.36up1
373.64osd.37up1
383.64osd.38up1
393.64osd.39up1
403.64osd.40up1
413.64osd.41up1
-66.54host us1-r01u25s01-compf-sata
831.09osd.83up1
841.09osd.84up1
851.09osd.85up1
861.09osd.86up1
871.09osd.87up1
881.09osd.88up1
-86.54host us1-r01u23s01-compf-sata
901.09osd.90up1
911.09osd.91up1
921.09osd.92up1
931.09osd.93up1
941.09osd.94up1
951.09osd.95up1
-12149.2rack rack04-sata
-472.8host us1-r04u05s01-ceph
423.64osd.42up1
433.64osd.43up1
443.64osd.44up1
453.64osd.45up1
463.64osd.46up1
473.64osd.47up1
483.64osd.48up1
493.64osd.49up1
503.64osd.50up1
513.64osd.51up1
523.64osd.52up1
533.64osd.53up1
543.64osd.54up1
553.64osd.55up1
563.64osd.56up1
573.64osd.57up1
583.64osd.58up1
593.64osd.59up1
603.64osd.60up1
613.64osd.61up1
-576.44host us1-r04u09s01-ceph
623.64osd.62up1
633.64osd.63up1
643.64osd.64up1
653.64osd.65up1
663.64osd.66up1
673.64osd.67up1
683.64osd.68up1
693.64osd.69up1
703.64osd.70up1
713.64osd.71up1
723.64osd.72up1
733.64osd.73up1
743.64osd.74up1
753.64osd.75up1
763.64osd.76up1
773.64osd.77up1
783.64osd.78up1
793.64osd.79up1
803.64osd.80up1
813.64osd.81up1
823.64osd.82up1
*6) OSDs from one of the cluster nodes (rest are similar)*
/dev/sda1 3905109820 16741944 3888367876 1%
/var/lib/ceph/osd/ceph-42
/dev/sdb1 3905109820 19553976 3885555844 1%
/var/lib/ceph/osd/ceph-43
/dev/sdc1 3905109820 18081680 3887028140 1%
/var/lib/ceph/osd/ceph-44
/dev/sdd1 3905109820 19070596 3886039224 1%
/var/lib/ceph/osd/ceph-45
/dev/sde1 3905109820 17949284 3887160536 1%
/var/lib/ceph/osd/ceph-46
/dev/sdf1 3905109820 18538344 3886571476 1%
/var/lib/ceph/osd/ceph-47
/dev/sdg1 3905109820 17792608 3887317212 1%
/var/lib/ceph/osd/ceph-48
/dev/sdh1 3905109820 20910976 3884198844 1%
/var/lib/ceph/osd/ceph-49
/dev/sdi1 3905109820 19683208 3885426612 1%
/var/lib/ceph/osd/ceph-50
/dev/sdj1 3905109820 20115236 3884994584 1%
/var/lib/ceph/osd/ceph-51
/dev/sdk1 3905109820 19152812 3885957008 1%
/var/lib/ceph/osd/ceph-52
/dev/sdm1 3905109820 18701728 3886408092 1%
/var/lib/ceph/osd/ceph-53
/dev/sdn1 3905109820 19603536 3885506284 1%
/var/lib/ceph/osd/ceph-54
/dev/sdo1 3905109820 20164928 3884944892 1%
/var/lib/ceph/osd/ceph-55
/dev/sdp1 3905109820 19093024 3886016796 1%
/var/lib/ceph/osd/ceph-56
/dev/sdq1 3905109820 18699344 3886410476 1%
/var/lib/ceph/osd/ceph-57
/dev/sdr1 3905109820 19267068 3885842752 1%
/var/lib/ceph/osd/ceph-58
/dev/sds1 3905109820 19745212 3885364608 1%
/var/lib/ceph/osd/ceph-59
/dev/sdt1 3905109820 16321696 3888788124 1%
/var/lib/ceph/osd/ceph-60
/dev/sdu1 3905109820 19154884 3885954936 1%
/var/lib/ceph/osd/ceph-61
*6) Journal Files (there are TWO SSDs)*
# parted /dev/sdy print
Model: ATA SanDisk SD7UB2Q5 (scsi)
Disk /dev/sdy: 512GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Number Start End Size File system Name
Flags
1 1049kB 10.5GB 10.5GB ceph journal
2 10.5GB 21.0GB 10.5GB ceph journal
3 21.0GB 31.5GB 10.5GB ceph journal
4 31.5GB 41.9GB 10.5GB ceph journal
5 41.9GB 52.4GB 10.5GB ceph journal
6 52.4GB 62.9GB 10.5GB ceph journal
7 62.9GB 73.4GB 10.5GB ceph journal
_________________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
_________________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com