Re: ceph cluster having blocke requests very frequently

Peter Maloney <peter.maloney@xxxxxxxxxxxxxxxxxxxx> · Tue, 15 Nov 2016 23:28:41 +0100



    On 11/15/16 22:13, Thomas Danan wrote:

    
        Very interesting ...
        

        Any idea why optimal tunable would help here ?
      
    
    I think there are some versions where it rebalances data a bunch to
    even things out... I don't know why I think that...where I read it
    or anything. Maybe it was only argonaut vs newer. But having to
    rebalance 75% of the data makes me feel more confident. (and keep in
    mind it significantly changes client version compatibility
    requirements, esp. kernel drivers which possibly don't even exist in
    any version that are compatible)

    
    And looking at iostat, etc., at times of blocks, it seems like 1-2
    disks are at 100% util%, and the rest are nearly idle, and the SSD
    journals rarely go above 10% or so (I bought 2 expensive micron DC
    ones per node). So I think balance is the most important thing I
    need, and just plain efficiency is the next thing (which might come
    from bluestore when it's ready, especially related to rbd snapshot
    CoW). Having 2 disks at 100% is like 300-560 iops, where the total
    server ought to do about 1700 iops (3 disks that go about 280 and 6
    more that do about 150 direct sync rand write 4k iops per node).
    That's about 21% utilization before it blocks.

    
    You could try getting the data out of my ganglia here (sda,sdb are
    the SSDs, and ceph2 sdg is broken and missing with bogus data on the
    graphs):

http://www.brockmann-consult.de/ganglia/graph_all_periods.php?title=&vl=&x=&n=&hreg%5B%5D=ceph.*&mreg%5B%5D=sd%5Bc-z%5D_util&gtype=line&glegend=show&aggregate=1

    
    But it's not that easy to get this info out of ganglia... highly
    customized graphing isn't the best.

    
          on our cluster we have 500TB of data, I am a bit
          concerned about changing it without taking lot of precautions
          . ...
      
    
    I can't guarantee a bug free experience, but you can change it, look
    at the rebalancing objects %, and if you don't like it, change it
    back (maybe it will be much less going from firefly to hammer than
    to jewel like me). But if you wait an hour before changing it back,
    you can bet it takes an hour to settle again. (or set nobackfill
    maybe). I don't like this, but I don't know what to do other than
    rate limit it and accept the enormous wait.

    
        I am curious to know how much time it takes you to change
          tunable, size of your cluster and observed impacts on client
          IO ...
      
    
    Well... it was at 77.65% or so (tunables made it 75% + more pgs),
    and now after almost 3 hours, it's at 75.141% ... so I imagine it'll
    take somewhere between 75 hours and forever minus a day or two. But
    with the sleep settings, it seems not to cause any issues. So if
    there's any chance of it balancing out the load on the osds, i'll
    try it. (and these numbers are with me fiddling with it and watching
    it every now and then... I'll set max backfills back to 1 and sleep
    back to about 0.6 when I go to bed... maybe then it'll be half
    speed)

    
    Also FYI I only have 31% space used (most of the disks I added were
    to make it not horribly slow rather than add space, since it was so
    slow with only 3 disks per OSD).

    
    The cluster is just 3 nodes, with 2x Micron S630DC-400, 3x
    HUS724040ALS640, and 6 x Hitachi HUA722020ALA330 (minus one dead
    one) (last one is SATA... just some old stuff I added to speed
    things up, which helped even though they're slower).

    
    # ceph df

      GLOBAL:

          SIZE       AVAIL      RAW USED     %RAW USED

          65173G     45110G       20063G         30.78

    
    And as for impact... I could tell you more tomorrow. But with the
    sleep settings, the 4k randwrite iops in fio benchmarks seems maybe
    half or same as before, and other behavior doesn't seem so
    bad...maybe even better than before on average with a few more
    hicups than before, but less blocking killing qemu VMs (which I
    can't explain...do tunables do that right away? or did the snap trim
    sleep do something? I doubt the recovery one did since there was no
    recovery until I decided to change things. Or just luck so far, and
    tomorrow morning some VMs will be dead like every morning since a
    week, needing SIGKILL).

    
        Yes We do have daily rbd snapshot from 16 different ceph
          RBD clients. Snapshoting the RBD image is quite immediate
          while we are seing the issue continuously during the day...
        

        Will check all of this tomorrow . ..
        

        Thanks again
        

        Thomas
        

          Sent from
            my Samsung device
        
        
        -------- Original message --------

        From: Peter Maloney <peter.maloney@xxxxxxxxxxxxxxxxxxxx> 

        Date: 11/15/16 21:27 (GMT+01:00) 

        To: Thomas Danan <Thomas.Danan@xxxxxxxxxxxxx> 

        Cc: ceph-users@xxxxxxxxxxxxxx 

        Subject: Re:  ceph cluster having blocke requests
        very frequently 

        
          On 11/15/16 14:05, Thomas Danan wrote:

            > Hi Peter,

            >

            > Ceph cluster version is 0.94.5 and we are running with
            Firefly tunables and also we have 10KPGs instead of the 30K
            / 40K we should have.

            > The linux kernel version is 3.10.0-327.36.1.el7.x86_64
            with RHEL 7.2

            >

            > On our side we havethe following settings:

            > mon_osd_adjust_heartbeat_grace = false

            > mon_osd_adjust_down_out_interval = false

            > mon_osd_min_down_reporters = 5

            > mon_osd_min_down_reports = 10

            >

            > explaining why the OSDs are not flapping but still they
            are behaving wrongly and generate the slow requests I am
            describing.

            >

            > The osd_op_complaint_time is with the default value (30
            sec), not sure I want to change it base on your experience

            I wasn't saying you should set the complaint time to 5, just
            saying

            that's why I have complaints logged with such low block
            times.

            > Thomas

            
            And now I'm testing this:

                    osd recovery sleep = 0.5

                    osd snap trim sleep = 0.5

            
            (or fiddling with it as low as 0.1 to make it rebalance
            faster)

            
            While also changing tunables to optimal (which will
            rebalance 75% of the

            objects)

            Which has very good results so far (a few <14s blocks
            right at the

            start, and none since, over an hour ago).

            
            And I'm somehow hoping that will fix my rbd export-diff
            issue too... but

            it at least appears to fix the rebalance causing blocks.

            
            Do you use rbd snapshots? I think that may be causing my
            issues, based

            on things like:

            
            >             "description":
            "osd_op(client.692201.0:20455419 4.1b5a5bc1

            > rbd_data.94a08238e1f29.000000000000617b [] snapc
            918d=[918d]

            > ack+ondisk+write+known_if_redirected e40036)",

            >             "initiated_at": "2016-11-15
            20:57:48.313432",

            >             "age": 409.634862,

            >             "duration": 3.377347,

            >             ...

            >                     {

            >                         "time": "2016-11-15
            20:57:48.313767",

            >                         "event": "waiting for subops
            from 0,1,8,22"

            >                     },

            >             ...

            >                     {

            >                         "time": "2016-11-15
            20:57:51.688530",

            >                         "event": "sub_op_applied_rec
            from 22"

            >                     },

            
            Which says "snapc" in there (CoW?), and I think shows that
            just one osd

            is delayed a few seconds and the rest are really fast, like
            you said.

            (and not sure why I see 4 osds here when I have size 3...
            node1 osd 0

            and 1, and node3 osd 8 and 22)

            
            or some (shorter I think) have description like:

            > osd_repop(client.426591.0:203051290 4.1f9

            >
            4:9fe4c001:::rbd_data.4cf92238e1f29.00000000000014ef:head v
            40047'2531604)

            
        This electronic message contains information from Mycom which
        may be privileged or confidential. The information is intended
        to be for the use of the individual(s) or entity named above. If
        you are not the intended recipient, be aware that any
        disclosure, copying, distribution or any other use of the
        contents of this information is prohibited. If you have received
        this electronic message in error, please notify us by post or
        telephone (to the numbers or correspondence address above) or by
        email (at the email address above) immediately.

      
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com