Re: dealing with the full osd / help reweight

Christian Balzer <chibi@xxxxxxx> · Tue, 29 Mar 2016 18:35:42 +0900

Hello,

On Tue, 29 Mar 2016 10:32:35 +0200 Jacek Jarosiewicz wrote:

> On 03/25/2016 04:39 AM, Christian Balzer wrote:
> >
> > Hello,
> >
> >>
> >> ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR
> >>    0 1.00000  1.00000  5585G  2653G  2931G 47.51 0.85
> >>    1 1.00000  1.00000  5585G  2960G  2624G 53.02 0.94
> >>    2 1.00000  1.00000  5585G  3193G  2391G 57.18 1.02
> >> 10 1.00000  1.00000  3723G  2315G  1408G 62.18 1.11
> >> 16 1.00000  1.00000  3723G   763G  2959G 20.50 0.36
> >>    3 1.00000  1.00000  5585G  3559G  2025G 63.73 1.13
> >>    4 1.00000  1.00000  5585G  2354G  3230G 42.16 0.75
> >> 11 1.00000  1.00000  3723G  1302G  2420G 34.99 0.62
> >> 17 0.95000  0.95000  3723G  3388G   334G 91.01 1.62
> >> 12 1.00000  1.00000  3723G  2922G   800G 78.50 1.40
> >>    5 1.00000  1.00000  5585G  3972G  1613G 71.12 1.27
> >>    6 1.00000  1.00000  5585G  2975G  2609G 53.28 0.95
> >>    7 1.00000  1.00000  5585G  2208G  3376G 39.54 0.70
> >> 13 1.00000  1.00000  3723G  2092G  1631G 56.19 1.00
> >> 18 1.00000  1.00000  3723G  3144G   578G 84.45 1.50
> >>    8 1.00000  1.00000  5585G  2909G  2675G 52.10 0.93
> >>    9 1.00000  1.00000  5585G  3089G  2495G 55.31 0.98
> >> 14 0.95000        0      0      0      0     0    0 (this osd is full
> >> at 97%)
> >> 15 1.00000  1.00000  3723G  2629G  1093G 70.63 1.26
> >> 19 1.00000  1.00000  3723G  1781G  1941G 47.86 0.85
> >>                 TOTAL 89360G 50217G 39143G 56.20
> >> MIN/MAX VAR: 0/1.62  STDDEV: 16.80
> >>
> > And this is were your problem stems from.
> > How did you deploy this cluster?
> > Normally the weight is the size of the OSD TB.
> > By setting it all to 1 essentially, you're filling up your 4TB drives
> > long before the 6TB ones.
> > I assume OSD 14 is also a 4TB one, right?
> >
> > What you want to do is once everything is "stable" as outlined above
> > is to very, VERY lightly adjust crush weights.
> > Adjusting things will move things around, sometimes rather randomly and
> > unexpectedly.
> > It can (at least temporarily) put even more objects on your already
> > overloaded OSDs, so limiting it to a really small amount (one or two
> > PGs at a time hopefully) this shouldn't be too much of an issue.
> > Of course you have far more data in your PGs than you ought to have,
> > due to your low PG count.
> >
> > What you want to do is to attract PGs to the bigger OSDs and also
> > keeping the host weight/ratios in mind.
> > So in your case I would start with a:
> > ---
> > ceph osd crush reweight osd.0 1.001
> > ---
> > Which should hopefully result in about one PG being moved to osd.0.
> > Observe if that's the case, where it came from, etc.
> > Then repeat this with osd.1 and 2, then 6 and 7, then 4.
> >
> > Track what's happening and keep doing this with the least utilized 6TB
> > OSDs until you have the 4TB OSDs at sensible utilization levels.
> > Again, keep in mind that the host weight (which is the sum of all
> > OSDs on it) should not deviate too much from the other hosts at this
> > point in time. Later on it should of course actually reflect reality.
> >
> > Once you have things where the 6TB OSDs have more or less the same
> > relative utilization as the 4TB ones you could either leave things
> > (crush weights) where they are or preferably take the plunge and set
> > things "correctly".
> >
> > I'd do it by first setting nobackfill, then go and set the all the
> > crush weights to the respective OSD size, for example:
> > ---
> > ceph osd crush reweight osd.0 5.585
> > ---
> > Then after setting all those weights unset nobackfill and let things
> > rebalance, if the ratios where close before this should result in
> > relatively little data movement.
> > You probably still want to do this during an off peak time of course.
> >
> > Then you get to think long and hard about increasing your PG count and
> > change that. Of course you could do that also after your 4TB OSDs are
> > no longer over-utilized.
> >
> > Regards,
> >
> > Christian
> >
> 
> The cluster started with half the osds and a lot less data.
> During testing we've hit the 'too many pgs per osd' error and found out, 
> that the number can't be decreased. That's why when going into 
> production we've set initial number of pgs per pool to smaller numbers.
> We should have increased the number of pgs earlier, but the amount of 
> data increased somewhat quickly and well... we've forgot to increase the 
> number of pgs on time.
> 
> Anyway over the weekend we've managed to get the cluster to a better 
> state - data is more balanced over the osds:
> 
> [root@cf04 ~]# ceph osd df
> ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR
>   0 1.00000  1.00000  5585G  2580G  3005G 46.19 0.87
>   1 1.00000  1.00000  5585G  3712G  1872G 66.47 1.25
>   2 1.00000  1.00000  5585G  3489G  2095G 62.49 1.17
> 10 1.00000  1.00000  3723G  2475G  1247G 66.49 1.25
> 16 1.00000  1.00000  3723G  1773G  1949G 47.64 0.89
>   3 1.00000  1.00000  5585G  3651G  1934G 65.37 1.23
>   4 1.00000  1.00000  5585G  3085G  2500G 55.24 1.04
> 11 1.00000  1.00000  3723G  1589G  2133G 42.69 0.80
> 17 1.00000  0.36897  3723G   912G  2811G 24.50 0.46
> 12 1.00000  0.29999  3723G  1575G  2148G 42.31 0.79
>   5 1.00000  0.78925  5585G  2486G  3098G 44.52 0.84
>   6 1.00000  1.00000  5585G  3266G  2319G 58.48 1.10
>   7 1.00000  1.00000  5585G  3157G  2427G 56.54 1.06
> 13 1.00000  1.00000  3723G  2082G  1641G 55.92 1.05
> 18 1.00000  0.46581  3723G  1750G  1972G 47.01 0.88
>   8 1.00000  1.00000  5585G  3079G  2506G 55.13 1.03
>   9 1.00000  1.00000  5585G  2816G  2768G 50.42 0.95
> 14 1.00000  0.29999  3723G  1906G  1816G 51.20 0.96
> 15 1.00000  0.64502  3723G  1436G  2286G 38.58 0.72
> 19 1.00000  1.00000  3723G  2791G   932G 74.97 1.41
>

I very specifically and intentionally wrote "ceph osd crush reweight" in
my reply above. 
While your current state of affairs is better, it is not permanent ("ceph
osd reweight" settings are lost if an OSD is set out) and what I outlined
should have left you with nearly perfect CRUSH weight ratios.

Oh well, since you're already far down that path, continue until the
respective ratios aka %USE in the output above are as close/similar to
each other as possible.

> [root@cf04 ~]# ceph -s
>      cluster 3469081f-9852-4b6e-b7ed-900e77c48bb5
>       health HEALTH_WARN
>              5 pgs backfill
>              3 pgs backfilling
>              7 pgs degraded
>              4 pgs recovery_wait
>              7 pgs stuck degraded
>              14 pgs stuck unclean
>              recovery 9386/97872838 objects degraded (0.010%)
>              recovery 8964110/97872838 objects misplaced (9.159%)
>              nodeep-scrub flag(s) set
>       monmap e1: 3 mons at 
> {cf01=10.4.10.211:6789/0,cf02=10.4.10.212:6789/0,cf03=10.4.10.213:6789/0}
>              election epoch 5994, quorum 0,1,2 cf01,cf02,cf03
>       osdmap e6626: 20 osds: 20 up, 20 in; 14 remapped pgs
>              flags nodeep-scrub
>        pgmap v12464669: 304 pgs, 17 pools, 24008 GB data, 45688 kobjects
>              49612 GB used, 43471 GB / 93083 GB avail
>              9386/97872838 objects degraded (0.010%)
>              8964110/97872838 objects misplaced (9.159%)
>                   287 active+clean
>                     5 active+remapped+wait_backfill
>                     4 active+recovery_wait+degraded+remapped
>                     3 active+degraded+remapped+backfilling
>                     3 active+clean+scrubbing
>                     2 active+remapped
> 
You might want to disable scrubbing (normal one) as well for the duration.

> 
> I'd like to set the crush weights to correct values (size in TB) - all 
> in one move - but I'm afraid it will result in a lot of data movement.
>
If your ratios are correct at that time, it will be very little, the heavy
lifting is mostly what you're doing now.

> So - assuming all goes well and the cluster will be in HEALTH_OK state 
> within a day or two - what would You recommend doing first - increasing 
> the pgs on the pools with most data (and is it safe to go from a low 
> number like 64 to 1024 in one step, or should we do this step by step - 
> by factor of two)?
> 
Recent versions of Ceph won't allow you to do large increases anyway
(doubling at most I think), so obviously the later. 
And yes, this will cause MASSIVE data movement, but it will also reduce
the amount of data moving around (smaller PGs) in the last step.
I would do this first.

> Or should we first adjust crush weights and then increase pgs?
> When adjusting crush weights should we reset the "reweight" to 1.0 or 
> should it be set to the number of TBs per drive as well?
> 
"Subcommand reweight reweights osd to 0.0 < <weight> < 1.0."
As I said, shouldn't have used that, it's a temporary crutch at best.

So as I wrote originally, set nobackfill, adjust all crush weights, set
all osd reweighs to 1, unset nobackfill and enjoy the show.
Which will be a tiny show the closer to equal your ratios were before
that, see above.

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com