extending ceph cluster with osds close to near full ratio (85%)

tyanko.alexiev@xxxxxxxxx (Tyanko Aleksiev) · Mon, 20 Feb 2017 17:29:16 +0100

Hi Brian,

On 14 February 2017 at 19:33, Brian Andrus <brian.andrus at dreamhost.com>
wrote:

>
>
> On Tue, Feb 14, 2017 at 5:27 AM, Tyanko Aleksiev <tyanko.alexiev at gmail.com
> > wrote:
>
>> Hi Cephers,
>>
>> At University of Zurich we are using Ceph as a storage back-end for our
>> OpenStack installation. Since we recently reached 70% of occupancy
>> (mostly caused by the cinder pool served by 16384PGs) we are in the
>> phase of extending the cluster with additional storage nodes of the same
>> type (except for a slight more powerful CPU).
>>
>> We decided to opt for a gradual OSD deployment: we created a temporary
>> "root"
>> bucket called "fresh-install" containing the newly installed nodes and
>> then we
>> moved OSDs from this bucket to the current production root via:
>>
>> ceph osd crush set osd.{id} {weight} host={hostname}
>> root={production_root}
>>
>> Everything seemed nicely planned but when we started adding a few new
>> OSDs to the cluster, and thus triggering a rebalancing, one of the OSDs,
>> already at 84% disk use, passed the 85% threshold. This in turn
>> triggered the "near full osd(s)" warning and more than 20PGs previously
>> in "wait_backfill" state were marked as: "wait_backfill+backfill_tooful
>> l".
>> Since the OSD kept growing until, reached 90% disk use, we decided to
>> reduce
>> its relative weight from 1 to 0.95.
>> The last action recalculated the crushmap and remapped a few PGs but did
>> not appear to move any data off the almost full OSD. Only when, by steps
>> of 0.05, we reached 0.50 of relative weight data was moved and some
>> "backfill_toofull" requests were released. However, he had do go down
>> almost to 0.10% of relative weight in order to trigger some additional
>> data movement and have the backfilling process finally finished.
>>
>> We are now adding new OSDs but the problem is constantly triggered since
>> we have multiple OSDs > 83% that starts growing during the rebalance.
>>
>> My questions are:
>>
>> - Is there something wrong in our process of adding new OSDs (some
>> additional
>> details below)?
>>
>>
> It could work but - also could be more disruptive than need be. We have a
> similar situation/configuration and what we do is start OSDs with ` osd
> crush initial weight = 0` as well as "crush_osd_location" set properly.
> This will weight the OSDs at 0 weight and let us bring them in in a
> controlled fashion. We bring them in to 1 (no disruption), then crush
> weight in gradually.
>

We are currently trying out this type of gradual insertion. Thanks!

>
>
>> - We also noticed that the problem has the tendency to cluster around the newly
>> added OSDs, so could those two things be correlated?
>>
>> I'm not sure which problem you are referring to - this OSDs filling?
> Possibly due to temporary files or some other mechanism I'm not familiar
> with adding a little extra data on top.
>
>> - Why reweighting does not trigger instant data moving? What's the logic
>> behind remapped PGs? Is there some sort of flat queue of tasks or does
>> it have some priorities defined?
>>
>>
> It should, perhaps you aren't choosing large enough increments or perhaps
> you have some settings set.
>

Indeed, with sufficiently large increments it triggers some instant pg
rebalancing.

>
>
>> - Did somebody experience this situation and eventually how was it solved/bypassed?
>>
>>
> FWIW, we also run a rebalance cronjob every hour with the following:
>
> `ceph osd reweight-by-utilization 103 .010 10`
>

Already running that but on a daily basis.

>
> it was detailed in another recent thread on [ceph-users]
>
>
>> Cluster details are as follows:
>>
>> - version: 0.94.9
>> - 5 monitors,
>> - 40 storage hosts with an overall of 24 X 4TB disks: 1 OSD/disk (960 OSDs in total),
>> - osd pool default size = 3,
>> - journaling is on SSDs.
>>
>> We have "hosts" failure domain. Relevant crushmap details:
>>
>> # rules
>> rule sas {
>>         ruleset 1
>>         type replicated
>>         min_size 1
>>         max_size 10
>>         step take sas
>>         step chooseleaf firstn 0 type host
>>         step emit
>> }
>>
>> root sas {
>>         id -41          # do not change unnecessarily
>>         # weight 3283.279
>>         alg straw
>>         hash 0  # rjenkins1
>>         item osd-l2-16 weight 87.360
>>         item osd-l4-06 weight 87.360
>>         ...
>>         item osd-k7-41 weight 14.560
>>         item osd-l4-36 weight 14.560
>>         item osd-k5-36 weight 14.560
>> }
>>
>> host osd-k7-21 {
>>         id -46          # do not change unnecessarily
>>         # weight 87.360
>>         alg straw
>>         hash 0  # rjenkins1
>>         item osd.281 weight 3.640
>>         item osd.282 weight 3.640
>>         item osd.285 weight 3.640
>>         ...
>> }
>>
>> host osd-k7-41 {
>>         id -50          # do not change unnecessarily
>>         # weight 14.560
>>         alg straw
>>         hash 0  # rjenkins1
>>         item osd.900 weight 3.640
>>         item osd.901 weight 3.640
>>         item osd.902 weight 3.640
>>         item osd.903 weight 3.640
>> }
>>
>>
>> As mentioned before we created a temporary bucket called "fresh-install"
>> containing the newly installed nodes (i.e.):
>>
>> root fresh-install {
>>         id -34          # do not change unnecessarily
>>         # weight 218.400
>>         alg straw
>>         hash 0  # rjenkins1
>>         item osd-k5-36-fresh weight 72.800
>>         item osd-k7-41-fresh weight 72.800
>>         item osd-l4-36-fresh weight 72.800
>> }
>>
>> Then, by steps of 6 OSDs (2 OSDs from each new host), we move OSDs from
>> the "fresh-install" to the "sas" bucket.
>>
>>
> I would highly recommend a simple script to weight in gradually as
> described above. Much more controllable and you can twiddle the knobs to
> your heart's desire.
>
>>
>> Thank you in advance for all the suggestions.
>>
>> Cheers,
>> Tyanko
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> Hope that helps.
>

Thanks for the suggestions.

Cheers,
Tyanko

>
> --
> Brian Andrus | Cloud Systems Engineer | DreamHost
> brian.andrus at DreamHost.com | www.dreamhost.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20170220/8ee7629e/attachment.htm>