Re: Ceph read / write : Terrible performance

Mark Nelson <mnelson@xxxxxxxxxx> · Thu, 03 Sep 2015 13:45:38 -0500

It's always possible it was the reboot (seriously!) :)

Mark

On 09/03/2015 12:16 PM, Ian Colle wrote:
Am I the only one who finds it funny that the "ceph problem" was fixed
by an update to the disk controller firmware? :-)

Ian

On Thu, Sep 3, 2015 at 11:13 AM, Vickey Singh
<vickey.singh22693@xxxxxxxxx <mailto:vickey.singh22693@xxxxxxxxx>> wrote:

    Hey Mark / Community

    These are the sequences of changes that seems to have fixed the ceph
    problem

    1#  Upgrading Disk controller firmware from 6.34 to 6.64  ( latest )
    2# Rebooting all nodes in order to make new firmware into effect

    Read and write operations are now normal as well as system load and
    CPU utilization

    - Vickey -

    On Wed, Sep 2, 2015 at 11:28 PM, Vickey Singh
    <vickey.singh22693@xxxxxxxxx <mailto:vickey.singh22693@xxxxxxxxx>>
    wrote:

        Thank You Mark , please see my response below.

        On Wed, Sep 2, 2015 at 5:23 PM, Mark Nelson <mnelson@xxxxxxxxxx
        <mailto:mnelson@xxxxxxxxxx>> wrote:

            On 09/02/2015 08:51 AM, Vickey Singh wrote:

                Hello Ceph Experts

                I have a strange problem , when i am reading or writing
                to Ceph pool ,
                its not writing properly. Please notice Cur MB/s which
                is going up and down

                --- Ceph Hammer 0.94.2
                -- CentOS 6, 2.6
                -- Ceph cluster is healthy

            You might find that CentOS7 gives you better performance.
            In some cases we were seeing nearly 2X.

        Wooo 2X , i would definitely plan for upgrade. Thanks

                One interesting thing is when every i start rados bench
                command for read
                or write CPU Idle % goes down ~10 and System load is
                increasing like
                anything.

                Hardware

                HpSL4540

            Please make sure the controller is on the newest firmware.
            There used to be a bug that would cause sequential write
            performance to bottleneck when writeback cache was enabled
            on the RAID controller.

        Last month i have upgraded the firmwares for this hardware , so
        i hope they are up to date.

                32Core CPU
                196G Memory
                10G Network

            Be sure to check the network too.  We've seen a lot of cases
            where folks have been burned by one of the NICs acting funky.

        At a first view , Interface looks good and they are pushing data
        nicely ( what ever they are getting )

                I don't think hardware is a problem.

                Please give me clues / pointers , how should i
                troubleshoot this problem.

                # rados bench -p glance-test 60 write
                   Maintaining 16 concurrent writes of 4194304 bytes for
                up to 60 seconds
                or 0 objects
                   Object prefix:
                benchmark_data_pouta-s01.pouta.csc.fi_2173350
                     sec Cur ops   started  finished  avg MB/s  cur
                MB/s  last lat   avg lat
                       0       0         0         0         0
                  0         -         0
                       1      16        20         4     15.99
                16   0.12308   0.10001
                       2      16        37        21   41.9841
                68   1.79104  0.827021
                       3      16        68        52   69.3122
                  124  0.084304  0.854829
                       4      16       114        98   97.9746
                  184   0.12285  0.614507
                       5      16       188       172   137.568
                  296  0.210669  0.449784
                       6      16       248       232   154.634
                  240  0.090418  0.390647
                       7      16       305       289    165.11
                  228  0.069769  0.347957
                       8      16       331       315   157.471
                  104  0.026247    0.3345
                       9      16       361       345   153.306
                  120  0.082861  0.320711
                      10      16       380       364   145.575
                76  0.027964  0.310004
                      11      16       393       377   137.067
                52   3.73332  0.393318
                      12      16       448       432   143.971
                  220  0.334664  0.415606
                      13      16       476       460   141.508
                  112  0.271096  0.406574
                      14      16       497       481   137.399
                84  0.257794  0.412006
                      15      16       507       491   130.906
                40   1.49351  0.428057
                      16      16       529       513   115.042
                88  0.399384   0.48009
                      17      16       533       517   94.6286
                16   5.50641  0.507804
                      18      16       537       521    83.405
                16   4.42682  0.549951
                      19      16       538       522    80.349
                  4   11.2052  0.570363
                2015-09-02 09:26:18.398641min lat: 0.023851 max lat:
                11.2052 avg lat:
                0.570363
                     sec Cur ops   started  finished  avg MB/s  cur
                MB/s  last lat   avg lat
                      20      16       538       522   77.3611
                  0         -  0.570363
                      21      16       540       524   74.8825
                  4   8.88847  0.591767
                      22      16       542       526   72.5748
                  8   1.41627  0.593555
                      23      16       543       527   70.2873
                  4    8.0856  0.607771
                      24      16       555       539   69.5674
                48  0.145199  0.781685
                      25      16       560       544   68.0177
                20    1.4342  0.787017
                      26      16       564       548   66.4241
                16  0.451905   0.78765
                      27      16       566       550   64.7055
                  8  0.611129  0.787898
                      28      16       570       554   63.3138
                16   2.51086  0.797067
                      29      16       570       554   61.5549
                  0         -  0.797067
                      30      16       572       556   60.1071
                  4   7.71382  0.830697
                      31      16       577       561   59.0515
                20   23.3501  0.916368
                      32      16       590       574   58.8705
                52  0.336684  0.956958
                      33      16       591       575   57.4986
                  4   1.92811  0.958647
                      34      16       591       575   56.0961
                  0         -  0.958647
                      35      16       591       575   54.7603
                  0         -  0.958647
                      36      16       597       581   54.0447
                  8  0.187351   1.00313
                      37      16       625       609   52.8394
                  112   2.12256   1.09256
                      38      16       631       615    52.227
                24   1.57413   1.10206
                      39      16       638       622   51.7232
                28   4.41663   1.15086
                2015-09-02 09:26:40.510623min lat: 0.023851 max lat:
                27.6704 avg lat:
                1.15657
                     sec Cur ops   started  finished  avg MB/s  cur
                MB/s  last lat   avg lat
                      40      16       652       636   51.8102
                56  0.113345   1.15657
                      41      16       682       666   53.1443
                  120  0.041251   1.17813
                      42      16       685       669   52.3395
                12  0.501285   1.17421
                      43      15       690       675   51.7955
                24   2.26605   1.18357
                      44      16       728       712   53.6062
                  148  0.589826   1.17478
                      45      16       728       712   52.6158
                  0         -   1.17478
                      46      16       728       712   51.6613
                  0         -   1.17478
                      47      16       728       712   50.7407
                  0         -   1.17478
                      48      16       772       756   52.9332
                44  0.234811    1.1946
                      49      16       835       819   56.3577
                  252   5.67087   1.12063
                      50      16       890       874   59.1252
                  220  0.230806   1.06778
                      51      16       896       880   58.5409
                24  0.382471   1.06121
                      52      16       896       880   57.5832
                  0         -   1.06121
                      53      16       896       880   56.6562
                  0         -   1.06121
                      54      16       896       880   55.7587
                  0         -   1.06121
                      55      16       897       881   54.9515
                  1   4.88333   1.06554
                      56      16       897       881   54.1077
                  0         -   1.06554
                      57      16       897       881   53.2894
                  0         -   1.06554
                      58      16       897       881   51.9335
                  0         -   1.06554
                      59      16       897       881   51.1792
                  0         -   1.06554
                2015-09-02 09:27:01.267301min lat: 0.01405 max lat:
                27.6704 avg lat: 1.06554
                     sec Cur ops   started  finished  avg MB/s  cur
                MB/s  last lat   avg lat
                      60      16       897       881   50.4445
                  0         -   1.06554

                      cluster 98d89661-f616-49eb-9ccf-84d720e179c0
                       health HEALTH_OK
                       monmap e3: 3 mons at
                {s01=10.100.50.1:6789/0,s02=10.100.50.2:6789/0,s03=1
                <http://10.100.50.1:6789/0,s02=10.100.50.2:6789/0,s03=1>
                <http://10.100.50.1:6789/0,s02=10.100.50.2:6789/0,s03=1>
                0.100.50.3:6789/0 <http://0.100.50.3:6789/0>
                <http://0.100.50.3:6789/0>}, election epoch 666,
                quorum 0,1,2 s01,s02,s03
                *     osdmap e121039: 240 osds: 240 up, 240 in*
                        pgmap v850698: 7232 pgs, 31 pools, 439 GB data,
                43090 kobjects
                              2635 GB used, 867 TB / 870 TB avail
                                  7226 active+clean
                                     6 active+clean+scrubbing+deep

            Note the last line there.  You'll likely want to try your
            test again when scrubbing is complete.  Also, you may want
            to try this script:

        Yeah i have tried few times when cluster is perfectly healthy (
        not doing scrubbing / repairs )

            https://github.com/ceph/cbt/blob/master/tools/readpgdump.py

            You can invoke it like:

            ceph pg dump | ./readpgdump.py

            That will give you a bunch of information about the pools on
            your system.  I'm a little concerned about how many PGs your
            glance-test pool may have given your totals above.

        Thanks for the link i would do that and also run rados bench for
        other pools ( where PG is higher )

        Now here are my some observations

        1#  When the cluster is not doing anything , Health_ok , with no
        background scrubbing / repairing. Also all system resources
        CPU/MEM/NET are mostly idle. In this Case when i start rados
        bench ( write / rand / seq ) , after suddenly a few seconds
                --- rados bench output drops from ~500M to few 10M
               --- At the same time CPU busy 90%  and System load bumps UP

        Once rados bench completes

              --- After few minutes System resources  becomes IDLE
        2#   Sometime some PG becomes unclean for a few minutes while
        rados bench runs and then then quickly they becomes active+clean

        I am out of clues , so any help from community that leads me to
        think in right direction , would be helpful.

        - Vickey -

                _______________________________________________
                ceph-users mailing list
                ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
                http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

            _______________________________________________
            ceph-users mailing list
            ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
            http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    _______________________________________________
    ceph-users mailing list
    ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Ian R. Colle
Global Director of Software Engineering
Red Hat, Inc.
icolle@xxxxxxxxxx <mailto:icolle@xxxxxxxxxx>
+1-303-601-7713
http://www.linkedin.com/in/ircolle
http://www.twitter.com/ircolle
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com