Hello, On Tue, 29 Mar 2016 10:32:35 +0200 Jacek Jarosiewicz wrote: > On 03/25/2016 04:39 AM, Christian Balzer wrote: > > > > Hello, > > > >> > >> ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR > >> 0 1.00000 1.00000 5585G 2653G 2931G 47.51 0.85 > >> 1 1.00000 1.00000 5585G 2960G 2624G 53.02 0.94 > >> 2 1.00000 1.00000 5585G 3193G 2391G 57.18 1.02 > >> 10 1.00000 1.00000 3723G 2315G 1408G 62.18 1.11 > >> 16 1.00000 1.00000 3723G 763G 2959G 20.50 0.36 > >> 3 1.00000 1.00000 5585G 3559G 2025G 63.73 1.13 > >> 4 1.00000 1.00000 5585G 2354G 3230G 42.16 0.75 > >> 11 1.00000 1.00000 3723G 1302G 2420G 34.99 0.62 > >> 17 0.95000 0.95000 3723G 3388G 334G 91.01 1.62 > >> 12 1.00000 1.00000 3723G 2922G 800G 78.50 1.40 > >> 5 1.00000 1.00000 5585G 3972G 1613G 71.12 1.27 > >> 6 1.00000 1.00000 5585G 2975G 2609G 53.28 0.95 > >> 7 1.00000 1.00000 5585G 2208G 3376G 39.54 0.70 > >> 13 1.00000 1.00000 3723G 2092G 1631G 56.19 1.00 > >> 18 1.00000 1.00000 3723G 3144G 578G 84.45 1.50 > >> 8 1.00000 1.00000 5585G 2909G 2675G 52.10 0.93 > >> 9 1.00000 1.00000 5585G 3089G 2495G 55.31 0.98 > >> 14 0.95000 0 0 0 0 0 0 (this osd is full > >> at 97%) > >> 15 1.00000 1.00000 3723G 2629G 1093G 70.63 1.26 > >> 19 1.00000 1.00000 3723G 1781G 1941G 47.86 0.85 > >> TOTAL 89360G 50217G 39143G 56.20 > >> MIN/MAX VAR: 0/1.62 STDDEV: 16.80 > >> > > And this is were your problem stems from. > > How did you deploy this cluster? > > Normally the weight is the size of the OSD TB. > > By setting it all to 1 essentially, you're filling up your 4TB drives > > long before the 6TB ones. > > I assume OSD 14 is also a 4TB one, right? > > > > What you want to do is once everything is "stable" as outlined above > > is to very, VERY lightly adjust crush weights. > > Adjusting things will move things around, sometimes rather randomly and > > unexpectedly. > > It can (at least temporarily) put even more objects on your already > > overloaded OSDs, so limiting it to a really small amount (one or two > > PGs at a time hopefully) this shouldn't be too much of an issue. > > Of course you have far more data in your PGs than you ought to have, > > due to your low PG count. > > > > What you want to do is to attract PGs to the bigger OSDs and also > > keeping the host weight/ratios in mind. > > So in your case I would start with a: > > --- > > ceph osd crush reweight osd.0 1.001 > > --- > > Which should hopefully result in about one PG being moved to osd.0. > > Observe if that's the case, where it came from, etc. > > Then repeat this with osd.1 and 2, then 6 and 7, then 4. > > > > Track what's happening and keep doing this with the least utilized 6TB > > OSDs until you have the 4TB OSDs at sensible utilization levels. > > Again, keep in mind that the host weight (which is the sum of all > > OSDs on it) should not deviate too much from the other hosts at this > > point in time. Later on it should of course actually reflect reality. > > > > Once you have things where the 6TB OSDs have more or less the same > > relative utilization as the 4TB ones you could either leave things > > (crush weights) where they are or preferably take the plunge and set > > things "correctly". > > > > I'd do it by first setting nobackfill, then go and set the all the > > crush weights to the respective OSD size, for example: > > --- > > ceph osd crush reweight osd.0 5.585 > > --- > > Then after setting all those weights unset nobackfill and let things > > rebalance, if the ratios where close before this should result in > > relatively little data movement. > > You probably still want to do this during an off peak time of course. > > > > Then you get to think long and hard about increasing your PG count and > > change that. Of course you could do that also after your 4TB OSDs are > > no longer over-utilized. > > > > Regards, > > > > Christian > > > > The cluster started with half the osds and a lot less data. > During testing we've hit the 'too many pgs per osd' error and found out, > that the number can't be decreased. That's why when going into > production we've set initial number of pgs per pool to smaller numbers. > We should have increased the number of pgs earlier, but the amount of > data increased somewhat quickly and well... we've forgot to increase the > number of pgs on time. > > Anyway over the weekend we've managed to get the cluster to a better > state - data is more balanced over the osds: > > [root@cf04 ~]# ceph osd df > ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR > 0 1.00000 1.00000 5585G 2580G 3005G 46.19 0.87 > 1 1.00000 1.00000 5585G 3712G 1872G 66.47 1.25 > 2 1.00000 1.00000 5585G 3489G 2095G 62.49 1.17 > 10 1.00000 1.00000 3723G 2475G 1247G 66.49 1.25 > 16 1.00000 1.00000 3723G 1773G 1949G 47.64 0.89 > 3 1.00000 1.00000 5585G 3651G 1934G 65.37 1.23 > 4 1.00000 1.00000 5585G 3085G 2500G 55.24 1.04 > 11 1.00000 1.00000 3723G 1589G 2133G 42.69 0.80 > 17 1.00000 0.36897 3723G 912G 2811G 24.50 0.46 > 12 1.00000 0.29999 3723G 1575G 2148G 42.31 0.79 > 5 1.00000 0.78925 5585G 2486G 3098G 44.52 0.84 > 6 1.00000 1.00000 5585G 3266G 2319G 58.48 1.10 > 7 1.00000 1.00000 5585G 3157G 2427G 56.54 1.06 > 13 1.00000 1.00000 3723G 2082G 1641G 55.92 1.05 > 18 1.00000 0.46581 3723G 1750G 1972G 47.01 0.88 > 8 1.00000 1.00000 5585G 3079G 2506G 55.13 1.03 > 9 1.00000 1.00000 5585G 2816G 2768G 50.42 0.95 > 14 1.00000 0.29999 3723G 1906G 1816G 51.20 0.96 > 15 1.00000 0.64502 3723G 1436G 2286G 38.58 0.72 > 19 1.00000 1.00000 3723G 2791G 932G 74.97 1.41 > I very specifically and intentionally wrote "ceph osd crush reweight" in my reply above. While your current state of affairs is better, it is not permanent ("ceph osd reweight" settings are lost if an OSD is set out) and what I outlined should have left you with nearly perfect CRUSH weight ratios. Oh well, since you're already far down that path, continue until the respective ratios aka %USE in the output above are as close/similar to each other as possible. > [root@cf04 ~]# ceph -s > cluster 3469081f-9852-4b6e-b7ed-900e77c48bb5 > health HEALTH_WARN > 5 pgs backfill > 3 pgs backfilling > 7 pgs degraded > 4 pgs recovery_wait > 7 pgs stuck degraded > 14 pgs stuck unclean > recovery 9386/97872838 objects degraded (0.010%) > recovery 8964110/97872838 objects misplaced (9.159%) > nodeep-scrub flag(s) set > monmap e1: 3 mons at > {cf01=10.4.10.211:6789/0,cf02=10.4.10.212:6789/0,cf03=10.4.10.213:6789/0} > election epoch 5994, quorum 0,1,2 cf01,cf02,cf03 > osdmap e6626: 20 osds: 20 up, 20 in; 14 remapped pgs > flags nodeep-scrub > pgmap v12464669: 304 pgs, 17 pools, 24008 GB data, 45688 kobjects > 49612 GB used, 43471 GB / 93083 GB avail > 9386/97872838 objects degraded (0.010%) > 8964110/97872838 objects misplaced (9.159%) > 287 active+clean > 5 active+remapped+wait_backfill > 4 active+recovery_wait+degraded+remapped > 3 active+degraded+remapped+backfilling > 3 active+clean+scrubbing > 2 active+remapped > You might want to disable scrubbing (normal one) as well for the duration. > > I'd like to set the crush weights to correct values (size in TB) - all > in one move - but I'm afraid it will result in a lot of data movement. > If your ratios are correct at that time, it will be very little, the heavy lifting is mostly what you're doing now. > So - assuming all goes well and the cluster will be in HEALTH_OK state > within a day or two - what would You recommend doing first - increasing > the pgs on the pools with most data (and is it safe to go from a low > number like 64 to 1024 in one step, or should we do this step by step - > by factor of two)? > Recent versions of Ceph won't allow you to do large increases anyway (doubling at most I think), so obviously the later. And yes, this will cause MASSIVE data movement, but it will also reduce the amount of data moving around (smaller PGs) in the last step. I would do this first. > Or should we first adjust crush weights and then increase pgs? > When adjusting crush weights should we reset the "reweight" to 1.0 or > should it be set to the number of TBs per drive as well? > "Subcommand reweight reweights osd to 0.0 < <weight> < 1.0." As I said, shouldn't have used that, it's a temporary crutch at best. So as I wrote originally, set nobackfill, adjust all crush weights, set all osd reweighs to 1, unset nobackfill and enjoy the show. Which will be a tiny show the closer to equal your ratios were before that, see above. Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com