On 03/25/2016 04:39 AM, Christian Balzer wrote:
Hello,
ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR
0 1.00000 1.00000 5585G 2653G 2931G 47.51 0.85
1 1.00000 1.00000 5585G 2960G 2624G 53.02 0.94
2 1.00000 1.00000 5585G 3193G 2391G 57.18 1.02
10 1.00000 1.00000 3723G 2315G 1408G 62.18 1.11
16 1.00000 1.00000 3723G 763G 2959G 20.50 0.36
3 1.00000 1.00000 5585G 3559G 2025G 63.73 1.13
4 1.00000 1.00000 5585G 2354G 3230G 42.16 0.75
11 1.00000 1.00000 3723G 1302G 2420G 34.99 0.62
17 0.95000 0.95000 3723G 3388G 334G 91.01 1.62
12 1.00000 1.00000 3723G 2922G 800G 78.50 1.40
5 1.00000 1.00000 5585G 3972G 1613G 71.12 1.27
6 1.00000 1.00000 5585G 2975G 2609G 53.28 0.95
7 1.00000 1.00000 5585G 2208G 3376G 39.54 0.70
13 1.00000 1.00000 3723G 2092G 1631G 56.19 1.00
18 1.00000 1.00000 3723G 3144G 578G 84.45 1.50
8 1.00000 1.00000 5585G 2909G 2675G 52.10 0.93
9 1.00000 1.00000 5585G 3089G 2495G 55.31 0.98
14 0.95000 0 0 0 0 0 0 (this osd is full at
97%)
15 1.00000 1.00000 3723G 2629G 1093G 70.63 1.26
19 1.00000 1.00000 3723G 1781G 1941G 47.86 0.85
TOTAL 89360G 50217G 39143G 56.20
MIN/MAX VAR: 0/1.62 STDDEV: 16.80
And this is were your problem stems from.
How did you deploy this cluster?
Normally the weight is the size of the OSD TB.
By setting it all to 1 essentially, you're filling up your 4TB drives long
before the 6TB ones.
I assume OSD 14 is also a 4TB one, right?
What you want to do is once everything is "stable" as outlined above is to
very, VERY lightly adjust crush weights.
Adjusting things will move things around, sometimes rather randomly and
unexpectedly.
It can (at least temporarily) put even more objects on your already
overloaded OSDs, so limiting it to a really small amount (one or two PGs at
a time hopefully) this shouldn't be too much of an issue.
Of course you have far more data in your PGs than you ought to have, due
to your low PG count.
What you want to do is to attract PGs to the bigger OSDs and also keeping
the host weight/ratios in mind.
So in your case I would start with a:
---
ceph osd crush reweight osd.0 1.001
---
Which should hopefully result in about one PG being moved to osd.0.
Observe if that's the case, where it came from, etc.
Then repeat this with osd.1 and 2, then 6 and 7, then 4.
Track what's happening and keep doing this with the least utilized 6TB
OSDs until you have the 4TB OSDs at sensible utilization levels.
Again, keep in mind that the host weight (which is the sum of all
OSDs on it) should not deviate too much from the other hosts at this point
in time. Later on it should of course actually reflect reality.
Once you have things where the 6TB OSDs have more or less the same
relative utilization as the 4TB ones you could either leave things (crush
weights) where they are or preferably take the plunge and set things
"correctly".
I'd do it by first setting nobackfill, then go and set the all the crush
weights to the respective OSD size, for example:
---
ceph osd crush reweight osd.0 5.585
---
Then after setting all those weights unset nobackfill and let things
rebalance, if the ratios where close before this should result in
relatively little data movement.
You probably still want to do this during an off peak time of course.
Then you get to think long and hard about increasing your PG count and
change that. Of course you could do that also after your 4TB OSDs are no
longer over-utilized.
Regards,
Christian
The cluster started with half the osds and a lot less data.
During testing we've hit the 'too many pgs per osd' error and found out,
that the number can't be decreased. That's why when going into
production we've set initial number of pgs per pool to smaller numbers.
We should have increased the number of pgs earlier, but the amount of
data increased somewhat quickly and well... we've forgot to increase the
number of pgs on time.
Anyway over the weekend we've managed to get the cluster to a better
state - data is more balanced over the osds:
[root@cf04 ~]# ceph osd df
ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR
0 1.00000 1.00000 5585G 2580G 3005G 46.19 0.87
1 1.00000 1.00000 5585G 3712G 1872G 66.47 1.25
2 1.00000 1.00000 5585G 3489G 2095G 62.49 1.17
10 1.00000 1.00000 3723G 2475G 1247G 66.49 1.25
16 1.00000 1.00000 3723G 1773G 1949G 47.64 0.89
3 1.00000 1.00000 5585G 3651G 1934G 65.37 1.23
4 1.00000 1.00000 5585G 3085G 2500G 55.24 1.04
11 1.00000 1.00000 3723G 1589G 2133G 42.69 0.80
17 1.00000 0.36897 3723G 912G 2811G 24.50 0.46
12 1.00000 0.29999 3723G 1575G 2148G 42.31 0.79
5 1.00000 0.78925 5585G 2486G 3098G 44.52 0.84
6 1.00000 1.00000 5585G 3266G 2319G 58.48 1.10
7 1.00000 1.00000 5585G 3157G 2427G 56.54 1.06
13 1.00000 1.00000 3723G 2082G 1641G 55.92 1.05
18 1.00000 0.46581 3723G 1750G 1972G 47.01 0.88
8 1.00000 1.00000 5585G 3079G 2506G 55.13 1.03
9 1.00000 1.00000 5585G 2816G 2768G 50.42 0.95
14 1.00000 0.29999 3723G 1906G 1816G 51.20 0.96
15 1.00000 0.64502 3723G 1436G 2286G 38.58 0.72
19 1.00000 1.00000 3723G 2791G 932G 74.97 1.41
[root@cf04 ~]# ceph -s
cluster 3469081f-9852-4b6e-b7ed-900e77c48bb5
health HEALTH_WARN
5 pgs backfill
3 pgs backfilling
7 pgs degraded
4 pgs recovery_wait
7 pgs stuck degraded
14 pgs stuck unclean
recovery 9386/97872838 objects degraded (0.010%)
recovery 8964110/97872838 objects misplaced (9.159%)
nodeep-scrub flag(s) set
monmap e1: 3 mons at
{cf01=10.4.10.211:6789/0,cf02=10.4.10.212:6789/0,cf03=10.4.10.213:6789/0}
election epoch 5994, quorum 0,1,2 cf01,cf02,cf03
osdmap e6626: 20 osds: 20 up, 20 in; 14 remapped pgs
flags nodeep-scrub
pgmap v12464669: 304 pgs, 17 pools, 24008 GB data, 45688 kobjects
49612 GB used, 43471 GB / 93083 GB avail
9386/97872838 objects degraded (0.010%)
8964110/97872838 objects misplaced (9.159%)
287 active+clean
5 active+remapped+wait_backfill
4 active+recovery_wait+degraded+remapped
3 active+degraded+remapped+backfilling
3 active+clean+scrubbing
2 active+remapped
I'd like to set the crush weights to correct values (size in TB) - all
in one move - but I'm afraid it will result in a lot of data movement.
So - assuming all goes well and the cluster will be in HEALTH_OK state
within a day or two - what would You recommend doing first - increasing
the pgs on the pools with most data (and is it safe to go from a low
number like 64 to 1024 in one step, or should we do this step by step -
by factor of two)?
Or should we first adjust crush weights and then increase pgs?
When adjusting crush weights should we reset the "reweight" to 1.0 or
should it be set to the number of TBs per drive as well?
Regards,
J
--
Jacek Jarosiewicz
Administrator Systemów Informatycznych
----------------------------------------------------------------------------------------
SUPERMEDIA Sp. z o.o. z siedzibą w Warszawie
ul. Senatorska 13/15, 00-075 Warszawa
Sąd Rejonowy dla m.st.Warszawy, XII Wydział Gospodarczy Krajowego
Rejestru Sądowego,
nr KRS 0000029537; kapitał zakładowy 44.556.000,00 zł
NIP: 957-05-49-503
Adres korespondencyjny: ul. Jubilerska 10, 04-190 Warszawa
----------------------------------------------------------------------------------------
SUPERMEDIA -> http://www.supermedia.pl
dostep do internetu - hosting - kolokacja - lacza - telefonia
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com