Re: dealing with the full osd / help reweight

Jacek Jarosiewicz <jjarosiewicz@xxxxxxxxxxxxx> · Tue, 29 Mar 2016 10:32:35 +0200

On 03/25/2016 04:39 AM, Christian Balzer wrote:

Hello,

ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR
   0 1.00000  1.00000  5585G  2653G  2931G 47.51 0.85
   1 1.00000  1.00000  5585G  2960G  2624G 53.02 0.94
   2 1.00000  1.00000  5585G  3193G  2391G 57.18 1.02
10 1.00000  1.00000  3723G  2315G  1408G 62.18 1.11
16 1.00000  1.00000  3723G   763G  2959G 20.50 0.36
   3 1.00000  1.00000  5585G  3559G  2025G 63.73 1.13
   4 1.00000  1.00000  5585G  2354G  3230G 42.16 0.75
11 1.00000  1.00000  3723G  1302G  2420G 34.99 0.62
17 0.95000  0.95000  3723G  3388G   334G 91.01 1.62
12 1.00000  1.00000  3723G  2922G   800G 78.50 1.40
   5 1.00000  1.00000  5585G  3972G  1613G 71.12 1.27
   6 1.00000  1.00000  5585G  2975G  2609G 53.28 0.95
   7 1.00000  1.00000  5585G  2208G  3376G 39.54 0.70
13 1.00000  1.00000  3723G  2092G  1631G 56.19 1.00
18 1.00000  1.00000  3723G  3144G   578G 84.45 1.50
   8 1.00000  1.00000  5585G  2909G  2675G 52.10 0.93
   9 1.00000  1.00000  5585G  3089G  2495G 55.31 0.98
14 0.95000        0      0      0      0     0    0 (this osd is full at
97%)
15 1.00000  1.00000  3723G  2629G  1093G 70.63 1.26
19 1.00000  1.00000  3723G  1781G  1941G 47.86 0.85
                TOTAL 89360G 50217G 39143G 56.20
MIN/MAX VAR: 0/1.62  STDDEV: 16.80

And this is were your problem stems from.
How did you deploy this cluster?
Normally the weight is the size of the OSD TB.
By setting it all to 1 essentially, you're filling up your 4TB drives long
before the 6TB ones.
I assume OSD 14 is also a 4TB one, right?

What you want to do is once everything is "stable" as outlined above is to
very, VERY lightly adjust crush weights.
Adjusting things will move things around, sometimes rather randomly and
unexpectedly.
It can (at least temporarily) put even more objects on your already
overloaded OSDs, so limiting it to a really small amount (one or two PGs at
a time hopefully) this shouldn't be too much of an issue.
Of course you have far more data in your PGs than you ought to have, due
to your low PG count.

What you want to do is to attract PGs to the bigger OSDs and also keeping
the host weight/ratios in mind.
So in your case I would start with a:
---
ceph osd crush reweight osd.0 1.001
---
Which should hopefully result in about one PG being moved to osd.0.
Observe if that's the case, where it came from, etc.
Then repeat this with osd.1 and 2, then 6 and 7, then 4.

Track what's happening and keep doing this with the least utilized 6TB
OSDs until you have the 4TB OSDs at sensible utilization levels.
Again, keep in mind that the host weight (which is the sum of all
OSDs on it) should not deviate too much from the other hosts at this point
in time. Later on it should of course actually reflect reality.

Once you have things where the 6TB OSDs have more or less the same
relative utilization as the 4TB ones you could either leave things (crush
weights) where they are or preferably take the plunge and set things
"correctly".

I'd do it by first setting nobackfill, then go and set the all the crush
weights to the respective OSD size, for example:
---
ceph osd crush reweight osd.0 5.585
---
Then after setting all those weights unset nobackfill and let things
rebalance, if the ratios where close before this should result in
relatively little data movement.
You probably still want to do this during an off peak time of course.

Then you get to think long and hard about increasing your PG count and
change that. Of course you could do that also after your 4TB OSDs are no
longer over-utilized.

Regards,

Christian

The cluster started with half the osds and a lot less data.
During testing we've hit the 'too many pgs per osd' error and found out, 
that the number can't be decreased. That's why when going into 
production we've set initial number of pgs per pool to smaller numbers.
We should have increased the number of pgs earlier, but the amount of 
data increased somewhat quickly and well... we've forgot to increase the 
number of pgs on time.

Anyway over the weekend we've managed to get the cluster to a better 
state - data is more balanced over the osds:

[root@cf04 ~]# ceph osd df
ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR
 0 1.00000  1.00000  5585G  2580G  3005G 46.19 0.87
 1 1.00000  1.00000  5585G  3712G  1872G 66.47 1.25
 2 1.00000  1.00000  5585G  3489G  2095G 62.49 1.17
10 1.00000  1.00000  3723G  2475G  1247G 66.49 1.25
16 1.00000  1.00000  3723G  1773G  1949G 47.64 0.89
 3 1.00000  1.00000  5585G  3651G  1934G 65.37 1.23
 4 1.00000  1.00000  5585G  3085G  2500G 55.24 1.04
11 1.00000  1.00000  3723G  1589G  2133G 42.69 0.80
17 1.00000  0.36897  3723G   912G  2811G 24.50 0.46
12 1.00000  0.29999  3723G  1575G  2148G 42.31 0.79
 5 1.00000  0.78925  5585G  2486G  3098G 44.52 0.84
 6 1.00000  1.00000  5585G  3266G  2319G 58.48 1.10
 7 1.00000  1.00000  5585G  3157G  2427G 56.54 1.06
13 1.00000  1.00000  3723G  2082G  1641G 55.92 1.05
18 1.00000  0.46581  3723G  1750G  1972G 47.01 0.88
 8 1.00000  1.00000  5585G  3079G  2506G 55.13 1.03
 9 1.00000  1.00000  5585G  2816G  2768G 50.42 0.95
14 1.00000  0.29999  3723G  1906G  1816G 51.20 0.96
15 1.00000  0.64502  3723G  1436G  2286G 38.58 0.72
19 1.00000  1.00000  3723G  2791G   932G 74.97 1.41

[root@cf04 ~]# ceph -s
    cluster 3469081f-9852-4b6e-b7ed-900e77c48bb5
     health HEALTH_WARN
            5 pgs backfill
            3 pgs backfilling
            7 pgs degraded
            4 pgs recovery_wait
            7 pgs stuck degraded
            14 pgs stuck unclean
            recovery 9386/97872838 objects degraded (0.010%)
            recovery 8964110/97872838 objects misplaced (9.159%)
            nodeep-scrub flag(s) set
     monmap e1: 3 mons at 
{cf01=10.4.10.211:6789/0,cf02=10.4.10.212:6789/0,cf03=10.4.10.213:6789/0}
            election epoch 5994, quorum 0,1,2 cf01,cf02,cf03
     osdmap e6626: 20 osds: 20 up, 20 in; 14 remapped pgs
            flags nodeep-scrub
      pgmap v12464669: 304 pgs, 17 pools, 24008 GB data, 45688 kobjects
            49612 GB used, 43471 GB / 93083 GB avail
            9386/97872838 objects degraded (0.010%)
            8964110/97872838 objects misplaced (9.159%)
                 287 active+clean
                   5 active+remapped+wait_backfill
                   4 active+recovery_wait+degraded+remapped
                   3 active+degraded+remapped+backfilling
                   3 active+clean+scrubbing
                   2 active+remapped

I'd like to set the crush weights to correct values (size in TB) - all 
in one move - but I'm afraid it will result in a lot of data movement.

So - assuming all goes well and the cluster will be in HEALTH_OK state 
within a day or two - what would You recommend doing first - increasing 
the pgs on the pools with most data (and is it safe to go from a low 
number like 64 to 1024 in one step, or should we do this step by step - 
by factor of two)?

Or should we first adjust crush weights and then increase pgs?
When adjusting crush weights should we reset the "reweight" to 1.0 or 
should it be set to the number of TBs per drive as well?

Regards,
J

--
Jacek Jarosiewicz
Administrator Systemów Informatycznych

----------------------------------------------------------------------------------------
SUPERMEDIA Sp. z o.o. z siedzibą w Warszawie
ul. Senatorska 13/15, 00-075 Warszawa
Sąd Rejonowy dla m.st.Warszawy, XII Wydział Gospodarczy Krajowego 
Rejestru Sądowego,
nr KRS 0000029537; kapitał zakładowy 44.556.000,00 zł
NIP: 957-05-49-503
Adres korespondencyjny: ul. Jubilerska 10, 04-190 Warszawa

----------------------------------------------------------------------------------------
SUPERMEDIA ->   http://www.supermedia.pl
dostep do internetu - hosting - kolokacja - lacza - telefonia
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com