Re: ceph cluster having blocke requests very frequently

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 11/15/16 22:13, Thomas Danan wrote:
Very interesting ...

Any idea why optimal tunable would help here ?
I think there are some versions where it rebalances data a bunch to even things out... I don't know why I think that...where I read it or anything. Maybe it was only argonaut vs newer. But having to rebalance 75% of the data makes me feel more confident. (and keep in mind it significantly changes client version compatibility requirements, esp. kernel drivers which possibly don't even exist in any version that are compatible)

And looking at iostat, etc., at times of blocks, it seems like 1-2 disks are at 100% util%, and the rest are nearly idle, and the SSD journals rarely go above 10% or so (I bought 2 expensive micron DC ones per node). So I think balance is the most important thing I need, and just plain efficiency is the next thing (which might come from bluestore when it's ready, especially related to rbd snapshot CoW). Having 2 disks at 100% is like 300-560 iops, where the total server ought to do about 1700 iops (3 disks that go about 280 and 6 more that do about 150 direct sync rand write 4k iops per node). That's about 21% utilization before it blocks.

You could try getting the data out of my ganglia here (sda,sdb are the SSDs, and ceph2 sdg is broken and missing with bogus data on the graphs):
http://www.brockmann-consult.de/ganglia/graph_all_periods.php?title=&vl=&x=&n=&hreg%5B%5D=ceph.*&mreg%5B%5D=sd%5Bc-z%5D_util&gtype=line&glegend=show&aggregate=1

But it's not that easy to get this info out of ganglia... highly customized graphing isn't the best.

 on our cluster we have 500TB of data, I am a bit concerned about changing it without taking lot of precautions . ...
I can't guarantee a bug free experience, but you can change it, look at the rebalancing objects %, and if you don't like it, change it back (maybe it will be much less going from firefly to hammer than to jewel like me). But if you wait an hour before changing it back, you can bet it takes an hour to settle again. (or set nobackfill maybe). I don't like this, but I don't know what to do other than rate limit it and accept the enormous wait.
I am curious to know how much time it takes you to change tunable, size of your cluster and observed impacts on client IO ...
Well... it was at 77.65% or so (tunables made it 75% + more pgs), and now after almost 3 hours, it's at 75.141% ... so I imagine it'll take somewhere between 75 hours and forever minus a day or two. But with the sleep settings, it seems not to cause any issues. So if there's any chance of it balancing out the load on the osds, i'll try it. (and these numbers are with me fiddling with it and watching it every now and then... I'll set max backfills back to 1 and sleep back to about 0.6 when I go to bed... maybe then it'll be half speed)

Also FYI I only have 31% space used (most of the disks I added were to make it not horribly slow rather than add space, since it was so slow with only 3 disks per OSD).

The cluster is just 3 nodes, with 2x Micron S630DC-400, 3x HUS724040ALS640, and 6 x Hitachi HUA722020ALA330 (minus one dead one) (last one is SATA... just some old stuff I added to speed things up, which helped even though they're slower).

# ceph df
GLOBAL:
    SIZE       AVAIL      RAW USED     %RAW USED
    65173G     45110G       20063G         30.78
And as for impact... I could tell you more tomorrow. But with the sleep settings, the 4k randwrite iops in fio benchmarks seems maybe half or same as before, and other behavior doesn't seem so bad...maybe even better than before on average with a few more hicups than before, but less blocking killing qemu VMs (which I can't explain...do tunables do that right away? or did the snap trim sleep do something? I doubt the recovery one did since there was no recovery until I decided to change things. Or just luck so far, and tomorrow morning some VMs will be dead like every morning since a week, needing SIGKILL).

Yes We do have daily rbd snapshot from 16 different ceph RBD clients. Snapshoting the RBD image is quite immediate while we are seing the issue continuously during the day...

Will check all of this tomorrow . ..

Thanks again

Thomas



Sent from my Samsung device


-------- Original message --------
From: Peter Maloney <peter.maloney@xxxxxxxxxxxxxxxxxxxx>
Date: 11/15/16 21:27 (GMT+01:00)
To: Thomas Danan <Thomas.Danan@xxxxxxxxxxxxx>
Cc: ceph-users@xxxxxxxxxxxxxx
Subject: Re: ceph cluster having blocke requests very frequently

On 11/15/16 14:05, Thomas Danan wrote:
> Hi Peter,
>
> Ceph cluster version is 0.94.5 and we are running with Firefly tunables and also we have 10KPGs instead of the 30K / 40K we should have.
> The linux kernel version is 3.10.0-327.36.1.el7.x86_64 with RHEL 7.2
>
> On our side we havethe following settings:
> mon_osd_adjust_heartbeat_grace = false
> mon_osd_adjust_down_out_interval = false
> mon_osd_min_down_reporters = 5
> mon_osd_min_down_reports = 10
>
> explaining why the OSDs are not flapping but still they are behaving wrongly and generate the slow requests I am describing.
>
> The osd_op_complaint_time is with the default value (30 sec), not sure I want to change it base on your experience
I wasn't saying you should set the complaint time to 5, just saying
that's why I have complaints logged with such low block times.
> Thomas

And now I'm testing this:
        osd recovery sleep = 0.5
        osd snap trim sleep = 0.5

(or fiddling with it as low as 0.1 to make it rebalance faster)

While also changing tunables to optimal (which will rebalance 75% of the
objects)
Which has very good results so far (a few <14s blocks right at the
start, and none since, over an hour ago).

And I'm somehow hoping that will fix my rbd export-diff issue too... but
it at least appears to fix the rebalance causing blocks.

Do you use rbd snapshots? I think that may be causing my issues, based
on things like:

>             "description": "osd_op(client.692201.0:20455419 4.1b5a5bc1
> rbd_data.94a08238e1f29.000000000000617b [] snapc 918d=[918d]
> ack+ondisk+write+known_if_redirected e40036)",
>             "initiated_at": "2016-11-15 20:57:48.313432",
>             "age": 409.634862,
>             "duration": 3.377347,
>             ...
>                     {
>                         "time": "2016-11-15 20:57:48.313767",
>                         "event": "waiting for subops from 0,1,8,22"
>                     },
>             ...
>                     {
>                         "time": "2016-11-15 20:57:51.688530",
>                         "event": "sub_op_applied_rec from 22"
>                     },


Which says "snapc" in there (CoW?), and I think shows that just one osd
is delayed a few seconds and the rest are really fast, like you said.
(and not sure why I see 4 osds here when I have size 3... node1 osd 0
and 1, and node3 osd 8 and 22)

or some (shorter I think) have description like:
> osd_repop(client.426591.0:203051290 4.1f9
> 4:9fe4c001:::rbd_data.4cf92238e1f29.00000000000014ef:head v 40047'2531604)





This electronic message contains information from Mycom which may be privileged or confidential. The information is intended to be for the use of the individual(s) or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or any other use of the contents of this information is prohibited. If you have received this electronic message in error, please notify us by post or telephone (to the numbers or correspondence address above) or by email (at the email address above) immediately.


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux