Re: Ceph performance - 10 times slower

Mark Nelson <mark.nelson@xxxxxxxxxxx> · Thu, 20 Nov 2014 11:02:25 -0600

Ah, interesting!  perhaps submit a bug report in the tracker.

Thanks!
Mark

On 11/20/2014 10:20 AM, René Gallati wrote:
Hello Mark,

ah thanks for that information, I have found the problem causing
confusion with rados bench.

If you ever do a

rados -p <pool> bench <time> write --no-cleanup

without any -b parameter, it creates 4M block objects. So far so good.
If you subsequently read with a -b parameter from that pool, it will
still read the entire block objects, no matter what you set -b to, but
it WILL display bandwidth as if the block read were -b size large,
leading to a much lower bandwidth displayed during the test run as it
should be, if the size in -b is lower than 4M.

In the result summary, bandwidth speed will be correct, but not during
the running test where one line per second is printed which is directly
ops * -b size -> bandwidth conversion and since -b is not the amount
actually read, displayed value will be wrong (in the case of 4k vs 4M
that's 3 magnitudes).

Example (called back to back, the cluster isn't changing speed):

root@control1:~# rados -p rados bench 10 rand -t 16
    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
      0       0         0         0         0         0         -         0
      1      16       506       490   1959.59      1960  0.018479 0.0313537
      2      16       987       971   1941.66      1924  0.015547 0.0326188
^C
root@control1:~# rados -p rados bench 10 rand -t 16 -b 4096
    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
      0       0         0         0         0         0         -         0
      1      16       472       456   1.78072   1.78125  0.010168 0.0342658
      2      16       975       959   1.87266   1.96484  0.023054 0.0323466
^C

Perhaps rados should print a big warning whenever -b parameter does not
match object size during a read / random bench run or outright refuse to
run?

Kind regards

René

On 20.11.2014 16:55, Mark Nelson wrote:
Hi Rene,

The easiest way to check is to create a fresh pool and look at the files
that are created under an OSD for a PG associated with that pool. Here's
an example using firefly:

perf@magna003:/$ ceph-osd --version
ceph version 0.80.7-129-gc069bce
(c069bce4e8180da3c0ca4951365032a45df76468)

perf@magna003:/$ ceph osd pool create foo 1024 1024

perf@magna003:/$ ceph osd lspools
0 data,1 metadata,2 rbd,3 cbt-kernelrbdfio,4 foo,

perf@magna003:/$ ceph pg dump | grep "^4\." | tail -n 1
dumped all in format plain
4.4    0    0    0    0    0    0    0    active+clean    2014-11-20
03:46:53.407228    0'0    41:9 [1,7,3]    1    [1,7,3]    1    0'0
2014-11-20 03:46:32.986234    0'0    2014-11-20 03:46:32.986234

perf@magna003:/$ rados -p foo bench 30 write -b 8192 -t 16
...

perf@magna004:/$ ls -al /tmp/cbt/mnt/osd-device-1-data/current/4.4_head
total 236
drwxr-xr-x    2 root root   627 Nov 20 03:54 .
drwxr-xr-x 2213 root root 65536 Nov 20 03:46 ..
-rw-r--r--    1 root root  8192 Nov 20 03:53
benchmark\udata\umagna003\u23623\uobject1386__head_69872404__4
-rw-r--r--    1 root root  8192 Nov 20 03:53
benchmark\udata\umagna003\u23623\uobject1630__head_93B04C04__4
-rw-r--r--    1 root root  8192 Nov 20 03:54
benchmark\udata\umagna003\u23623\uobject3533__head_EF16A404__4
-rw-r--r--    1 root root  8192 Nov 20 03:54
benchmark\udata\umagna003\u23623\uobject4455__head_D77FB404__4
-rw-r--r--    1 root root  8192 Nov 20 03:54
benchmark\udata\umagna003\u23623\uobject6346__head_2CD39004__4
-rw-r--r--    1 root root  8192 Nov 20 03:54
benchmark\udata\umagna003\u23623\uobject6366__head_1C035804__4
-rw-r--r--    1 root root  8192 Nov 20 03:54
benchmark\udata\umagna003\u23623\uobject7345__head_D4F0F804__4
-rw-r--r--    1 root root  8192 Nov 20 03:54
benchmark\udata\umagna003\u23623\uobject8425__head_C961F404__4
-rw-r--r--    1 root root  8192 Nov 20 03:54
benchmark\udata\umagna003\u23623\uobject9199__head_9C2B4404__4

Mark

On 11/20/2014 09:15 AM, René Gallati wrote:
Hello Mark,

sorry for barging in there but are you sure this is correct? In my tests
the -b parameter in rados bench does exactly one thing and that is it
uses the value in its output to calculate IO bandwidth: taking the OPS
value and multiplies it with the -b value for display. However it
*always* performs 4M blocksize operations on the cluster. At least mine
does it:

rados -v
ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)

When I use -b 4096 it tells me I have 1.xx MBytes/sec bandwidth, however
looking at the network interfaces it pulls in 16 Gbit/sec (on 2x10Gb
LACP) which is the same value / ballpark when I just leave the parameter
away and it uses 4M blocksize. Also the IOPS values it displays are the
same in both cases.

So for me, the -b does nothing useful at all in rados bench except
falsifying the displayed value for bandwidth. I consider -b in rados
bench broken right now.

Example:

1) rados -p rbd_bench bench 10 rand -t 32 --run-name three
    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat
avg lat
      0       0         0         0         0         0
-         0
      1      32       552       520   2079.62      2080  0.074611
0.0576271
      2      31       961       930   1859.66      1640  0.035957
0.0664811
      3      32      1451      1419   1891.69      1956  0.029887
0.0664829
      4      31      1984      1953   1952.68      2136  0.070463
0.0643797
      5      31      2505      2474    1978.9      2084  0.253059
0.0639347
      6      31      2971      2940   1959.71      1864  0.032281
0.0634925
      7      31      3463      3432   1960.86      1968   0.03457
0.0647746
      8      32      3966      3934   1966.72      2008  0.037449
0.0643011
      9      32      4362      4330   1924.18      1584  0.043155
0.0661253
     10      32      4882      4850   1939.73      2080  0.034571
0.0655373
  Total time run:        10.105738
Total reads made:     4882
Read size:            4194304
Bandwidth (MB/sec):    1932.368

Average Latency:       0.0661377
Max latency:           0.890115
Min latency:           0.022432

network bandwidth on bond via bwm-ng:
      14.90 Gb/s  RX

2) rados -p rbd_bench bench 10 rand -t 32 -b 4096 --run-name three
    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat
avg lat
      0       0         0         0         0         0
-         0
      1      32       523       491   1.91749   1.91797   0.03142
0.0594356
      2      31      1043      1012   1.97591   2.03516   0.05142
0.061328
      3      31      1503      1472   1.91617   1.79688  0.165884
0.064101
      4      32      1998      1966    1.9195   1.92969  0.037625
0.0637315
      5      32      2510      2478   1.93555         2  0.034491
0.0639315
      6      31      2948      2917   1.89873   1.71484  0.033597
0.0652374
      7      32      3446      3414   1.90479   1.94141  0.038067
0.0649422
      8      32      3915      3883   1.89566   1.83203  0.201129
0.0653705
      9      31      4419      4388   1.90419   1.97266  0.079625
0.065216
     10      32      4918      4886   1.90827   1.94531  0.037762
0.0651324
  Total time run:        10.167264
Total reads made:     4918
Read size:            4194304
Bandwidth (MB/sec):    1934.837

Average Latency:       0.0658465
Max latency:           0.788744
Min latency:           0.018118

network bandwidht on bond via bwm-ng:
       15.53 Gb/s RX

So you see, output varies tremendously, network and cluster activity
(and ops column) does not. -b does not work.

Examples are on a non-productive cluster with 7 servers with SSD osds
exlusively, everything 2x10GBit lacp bond network.

Kind regards

René

On 20.11.2014 15:15, Mark Nelson wrote:
Hi Jay,

The -b parameter to rados bench controls the size of the object being
written.  previously you were writing out 8KB objects which behind the
scenes translates into writing out lots of small files on the OSDs
behind the scenes.  Your DD tests were doing 1MB writes which are much
larger and more sequential in nature (ie less moving the head of the
harddrives around on the platter so more time can be spent writing data
out).  The new rados bench test I had you run wrote out 4MB objects
which are larger yet and roughly the sweet spot as far as Ceph is
concerned for sequential reads/writes.

Hope this helps!

Mark

On 11/20/2014 07:31 AM, Jay Janardhan wrote:
Hi Mark,

The results are below. These numbers look good but I'm not really sure
what to conclude now.

# rados -p performance_test bench 120 write -b 4194304 -t 100
--no-cleanup

  Total time run:         120.133251

Total writes made:      17529

Write size:             4194304

Bandwidth (MB/sec):     583.652

Stddev Bandwidth:       269.76

Max bandwidth (MB/sec): 884

Min bandwidth (MB/sec): 0

Average Latency:        0.68418

Stddev Latency:         0.552344

Max latency:            5.06959

Min latency:            0.121746

# rados -p performance_test bench 120 seq -b 4194304 -t 100

  Total time run:        58.451831

Total reads made:     17529

Read size:            4194304

Bandwidth (MB/sec):    1199.552

Average Latency:       0.332538

Max latency:           3.72943

Min latency:           0.007074

On Wed, Nov 19, 2014 at 8:55 PM, Mark Nelson <mark.nelson@xxxxxxxxxxx
<mailto:mark.nelson@xxxxxxxxxxx>> wrote:

    On 11/19/2014 06:51 PM, Jay Janardhan wrote:

        Can someone help me what I can tune to improve the
performance? The
        cluster is pushing data at about 13 MB/s with a single copy of
data
        while the underlying disks can push 100+MB/s.

        Can anyone help me with this?

        *rados bench results:*

        Concurrency Replication size      Write(MB/s)     Seq
Read(MB/s)
        321 13.532.8
        32212.732.0
        3236.130.2

        *Commands I used (Pool size was updated appropriately):*

        rados -p performance_test bench 120 write -b 8192 -t 100
        --no-cleanup
        rados -p performance_test bench 120 seq -t 100

    How's performance if you do:

    rados -p performance_test bench 120 write -b 4194304 -t 100
--no-cleanup

    and

    rados -p performance_test bench 120 seq -b 4194304 -t 100

    instead?

    Mark

        1) *Disk tests - All have similar numbers:*
        # dd if=/dev/zero of=here bs=1G count=1 oflag=direct
        1+0 records in
        1+0 records out
        1073741824 bytes (1.1 GB) copied, 10.0691 s, 107 MB/s

        2) *10G network is not holding up*
        # iperf -c 10.13.10.15  -i2 -t 10

        ------------------------------__------------------------------

        Client connecting to 10.13.10.15, TCP port 5001

        TCP window size: 85.0 KByte (default)

        ------------------------------__------------------------------

        [  3] local 10.13.30.13 port 56459 connected with 10.13.10.15
        port 5001

        [ ID] Interval       Transfer     Bandwidth

        [  3]  0.0- 2.0 sec  2.17 GBytes  9.33 Gbits/sec

        [  3]  2.0- 4.0 sec  2.18 GBytes  9.37 Gbits/sec

        [  3]  4.0- 6.0 sec  2.18 GBytes  9.37 Gbits/sec

        [  3]  6.0- 8.0 sec  2.18 GBytes  9.38 Gbits/sec

        [  3]  8.0-10.0 sec  2.18 GBytes  9.37 Gbits/sec

        [  3]  0.0-10.0 sec  10.9 GBytes  9.36 Gbits/sec

        *3) Ceph Status*

        # ceph health
        HEALTH_OK
        root@us1-r04u05s01-ceph:~# ceph status
              cluster 5e95b6fa-0b99-4c31-8aa9-__7a88h6hc5eda
               health HEALTH_OK
               monmap e4: 4 mons at

{us1-r01u05s01-ceph=10.1.30.__10:6789/0,us1-r01u09s01-ceph=__10.1.30.11:6789/0,us1-__r04u05s01-ceph=10.1.30.14:__6789/0,us1-r04u09s01-ceph=10.__1.30.15:6789/0

<http://10.1.30.10:6789/0,us1-r01u09s01-ceph=10.1.30.11:6789/0,us1-r04u05s01-ceph=10.1.30.14:6789/0,us1-r04u09s01-ceph=10.1.30.15:6789/0>

<http://10.1.30.10:6789/0,us1-__r01u09s01-ceph=10.1.30.11:__6789/0,us1-r04u05s01-ceph=10.__1.30.14:6789/0,us1-r04u09s01-__ceph=10.1.30.15:6789/0

<http://10.1.30.10:6789/0,us1-r01u09s01-ceph=10.1.30.11:6789/0,us1-r04u05s01-ceph=10.1.30.14:6789/0,us1-r04u09s01-ceph=10.1.30.15:6789/0>>},

        election epoch 78, quorum 0,1,2,3

us1-r01u05s01-ceph,us1-__r01u09s01-ceph,us1-r04u05s01-__ceph,us1-r04u09s01-ceph

               osdmap e1029: 97 osds: 97 up, 97 in
                pgmap v1850869: 12480 pgs, 6 pools, 587 GB data, 116
        kobjects
                      1787 GB used, 318 TB / 320 TB avail
                         12480 active+clean
            client io 0 B/s rd, 25460 B/s wr, 20 op/s

        *4) Ceph configuration*

        # cat ceph.conf

        [global]
            auth cluster required = cephx
            auth service required = cephx
            auth client required = cephx
            cephx require signatures = True
            cephx cluster require signatures = True
            cephx service require signatures = False
            fsid = 5e95b6fa-0b99-4c31-8aa9-__7a88h6hc5eda
            osd pool default pg num = 4096
            osd pool default pgp num = 4096
            osd pool default size = 3
            osd pool default min size = 1
            osd pool default crush rule = 0
            # Disable in-memory logs
            debug_lockdep = 0/0
            debug_context = 0/0
            debug_crush = 0/0
            debug_buffer = 0/0
            debug_timer = 0/0
            debug_filer = 0/0
            debug_objecter = 0/0
            debug_rados = 0/0
            debug_rbd = 0/0
            debug_journaler = 0/0
            debug_objectcatcher = 0/0
            debug_client = 0/0
            debug_osd = 0/0
            debug_optracker = 0/0
            debug_objclass = 0/0
            debug_filestore = 0/0
            debug_journal = 0/0
            debug_ms = 0/0
            debug_monc = 0/0
            debug_tp = 0/0
            debug_auth = 0/0
            debug_finisher = 0/0
            debug_heartbeatmap = 0/0
            debug_perfcounter = 0/0
            debug_asok = 0/0
            debug_throttle = 0/0
            debug_mon = 0/0
            debug_paxos = 0/0
            debug_rgw = 0/0

        [mon]
            mon osd down out interval = 600
            mon osd min down reporters = 2
              [mon.us1-r01u05s01-ceph]
              host = us1-r01u05s01-ceph
              mon addr = 10.1.30.10
                [mon.us1-r01u09s01-ceph]
              host = us1-r01u09s01-ceph
              mon addr = 10.1.30.11
                [mon.us1-r04u05s01-ceph]
              host = us1-r04u05s01-ceph
              mon addr = 10.1.30.14
                [mon.us1-r04u09s01-ceph]
              host = us1-r04u09s01-ceph
              mon addr = 10.1.30.15
        [osd]
            osd mkfs type = xfs
            osd mkfs options xfs = -f -i size=2048
            osd mount options xfs = noatime
            osd journal size = 10000
            cluster_network = 10.2.0.0/16 <http://10.2.0.0/16>
        <http://10.2.0.0/16>
            public_network = 10.1.0.0/16 <http://10.1.0.0/16>
        <http://10.1.0.0/16>
            osd mon heartbeat interval = 30
            # Performance tuning
            filestore merge threshold = 40
            filestore split multiple = 8
            osd op threads = 8
            filestore op threads = 8
            filestore max sync interval = 5
            osd max scrubs = 1
            # Recovery tuning
            osd recovery max active = 5
            osd max backfills = 2
            osd recovery op priority = 2
            osd recovery max chunk = 8388608
            osd recovery threads = 1
            osd objectstore = filestore
            osd crush update on start = true

        [mds]

        *5) Ceph OSDs/Crushmap*

        # ceph osd tree
        # idweighttype nameup/downreweight
        -145.46root fusion_drives
        -115.46rack rack01-fusion
        -72.73host us1-r01u25s01-compf-fusion
        892.73osd.89up1
        -92.73host us1-r01u23s01-compf-fusion
        962.73osd.96up1
        -13315.2root sata_drives
        -10166rack rack01-sata
        -276.44host us1-r01u05s01-ceph
        03.64osd.0up1
        13.64osd.1up1
        23.64osd.2up1
        33.64osd.3up1
        43.64osd.4up1
        53.64osd.5up1
        63.64osd.6up1
        73.64osd.7up1
        83.64osd.8up1
        93.64osd.9up1
        103.64osd.10up1
        113.64osd.11up1
        123.64osd.12up1
        133.64osd.13up1
        143.64osd.14up1
        153.64osd.15up1
        163.64osd.16up1
        173.64osd.17up1
        183.64osd.18up1
        193.64osd.19up1
        203.64osd.20up1
        -376.44host us1-r01u09s01-ceph
        213.64osd.21up1
        223.64osd.22up1
        233.64osd.23up1
        243.64osd.24up1
        253.64osd.25up1
        263.64osd.26up1
        273.64osd.27up1
        283.64osd.28up1
        293.64osd.29up1
        303.64osd.30up1
        313.64osd.31up1
        323.64osd.32up1
        333.64osd.33up1
        343.64osd.34up1
        353.64osd.35up1
        363.64osd.36up1
        373.64osd.37up1
        383.64osd.38up1
        393.64osd.39up1
        403.64osd.40up1
        413.64osd.41up1
        -66.54host us1-r01u25s01-compf-sata
        831.09osd.83up1
        841.09osd.84up1
        851.09osd.85up1
        861.09osd.86up1
        871.09osd.87up1
        881.09osd.88up1
        -86.54host us1-r01u23s01-compf-sata
        901.09osd.90up1
        911.09osd.91up1
        921.09osd.92up1
        931.09osd.93up1
        941.09osd.94up1
        951.09osd.95up1
        -12149.2rack rack04-sata
        -472.8host us1-r04u05s01-ceph
        423.64osd.42up1
        433.64osd.43up1
        443.64osd.44up1
        453.64osd.45up1
        463.64osd.46up1
        473.64osd.47up1
        483.64osd.48up1
        493.64osd.49up1
        503.64osd.50up1
        513.64osd.51up1
        523.64osd.52up1
        533.64osd.53up1
        543.64osd.54up1
        553.64osd.55up1
        563.64osd.56up1
        573.64osd.57up1
        583.64osd.58up1
        593.64osd.59up1
        603.64osd.60up1
        613.64osd.61up1
        -576.44host us1-r04u09s01-ceph
        623.64osd.62up1
        633.64osd.63up1
        643.64osd.64up1
        653.64osd.65up1
        663.64osd.66up1
        673.64osd.67up1
        683.64osd.68up1
        693.64osd.69up1
        703.64osd.70up1
        713.64osd.71up1
        723.64osd.72up1
        733.64osd.73up1
        743.64osd.74up1
        753.64osd.75up1
        763.64osd.76up1
        773.64osd.77up1
        783.64osd.78up1
        793.64osd.79up1
        803.64osd.80up1
        813.64osd.81up1
        823.64osd.82up1

        *6) OSDs from one of the cluster nodes (rest are similar)*

        /dev/sda1                   3905109820 16741944
3888367876   1%
        /var/lib/ceph/osd/ceph-42
        /dev/sdb1                   3905109820 19553976
3885555844   1%
        /var/lib/ceph/osd/ceph-43
        /dev/sdc1                   3905109820 18081680
3887028140   1%
        /var/lib/ceph/osd/ceph-44
        /dev/sdd1                   3905109820 19070596
3886039224   1%
        /var/lib/ceph/osd/ceph-45
        /dev/sde1                   3905109820 17949284
3887160536   1%
        /var/lib/ceph/osd/ceph-46
        /dev/sdf1                   3905109820 18538344
3886571476   1%
        /var/lib/ceph/osd/ceph-47
        /dev/sdg1                   3905109820 17792608
3887317212   1%
        /var/lib/ceph/osd/ceph-48
        /dev/sdh1                   3905109820 20910976
3884198844   1%
        /var/lib/ceph/osd/ceph-49
        /dev/sdi1                   3905109820 19683208
3885426612   1%
        /var/lib/ceph/osd/ceph-50
        /dev/sdj1                   3905109820 20115236
3884994584   1%
        /var/lib/ceph/osd/ceph-51
        /dev/sdk1                   3905109820 19152812
3885957008   1%
        /var/lib/ceph/osd/ceph-52
        /dev/sdm1                   3905109820 18701728
3886408092   1%
        /var/lib/ceph/osd/ceph-53
        /dev/sdn1                   3905109820 19603536
3885506284   1%
        /var/lib/ceph/osd/ceph-54
        /dev/sdo1                   3905109820 20164928
3884944892   1%
        /var/lib/ceph/osd/ceph-55
        /dev/sdp1                   3905109820 19093024
3886016796   1%
        /var/lib/ceph/osd/ceph-56
        /dev/sdq1                   3905109820 18699344
3886410476   1%
        /var/lib/ceph/osd/ceph-57
        /dev/sdr1                   3905109820 19267068
3885842752   1%
        /var/lib/ceph/osd/ceph-58
        /dev/sds1                   3905109820 19745212
3885364608   1%
        /var/lib/ceph/osd/ceph-59
        /dev/sdt1                   3905109820 16321696
3888788124   1%
        /var/lib/ceph/osd/ceph-60
        /dev/sdu1                   3905109820 19154884
3885954936   1%
        /var/lib/ceph/osd/ceph-61

        *6) Journal Files (there are TWO SSDs)*
        # parted /dev/sdy print
        Model: ATA SanDisk SD7UB2Q5 (scsi)
        Disk /dev/sdy: 512GB
        Sector size (logical/physical): 512B/4096B
        Partition Table: gpt

        Number  Start   End     Size    File system  Name
Flags
           1      1049kB  10.5GB  10.5GB               ceph journal
           2      10.5GB  21.0GB  10.5GB               ceph journal
           3      21.0GB  31.5GB  10.5GB               ceph journal
           4      31.5GB  41.9GB  10.5GB               ceph journal
           5      41.9GB  52.4GB  10.5GB               ceph journal
           6      52.4GB  62.9GB  10.5GB               ceph journal
           7      62.9GB  73.4GB  10.5GB               ceph journal

        _________________________________________________
        ceph-users mailing list
        ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
        http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
        <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

    _________________________________________________
    ceph-users mailing list
    ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
    http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
    <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com