Hello, On Thu, 24 Mar 2016 10:11:09 +0100 Jacek Jarosiewicz wrote: > Hi! > > I have a problem with the osds getting full on our cluster. > > I've read all the topics on this list on how to deal with that, but I > have a few questions. > "All" is probably a misnomer here, your situation isn't all that uncommon. Monitoring free disk space with things like nagios and/or graphing things with things like graphite will avoid getting into this state of affairs to begin with. > First of all - the cluster itself is nowhere near being filled (about > 55% data is used), but the osds don't get filled equally. > See below. > I've tried adding osds, but it didn't help - still some osds are being > filled more than the others. I tried adjusting ratios, but it's not a > long-term solution. I've tried adjusting weights, but I'm not sure if > I'm doing it right.. > You're not. > At this point I had to stop the full osd (this is a production cluster) > so that the radosgw will work. > While understandable this of course isn't a good state nor the way forward. By doing so you triggered further data movement which will not improve things until your balancing is correct. > Am I correctly assuming, that since the cluster is in WARNING state (not > ERR) with that one osd down - that means I can safely delete some pgs > from that osd? They have copies on other osds, otherwise cluster would > be in ERR state? I can't start the osd because that would stop the > radosgw from working. > In theory, yes. In practice as well, but I would try other means to resolve this first. > Can you suggest how to reweight the osds so that the data will be > distributed evenly (more or less..). > See below. > Also - the cluster got stuck with some backfill_toofull pgs - is there a > way to deal with that? I've adjusted the ratio, but the pgs still are in > backfill_toofull state.. > See below. > Here's some info about the current cluster state - the norecover flag is > set, because recovery process caused the requests to be blocked and > radosgw giving too many errors, the flag is unset during the night. BTW > - is there a way to slow down the rebalancing so that the cluster will > still be responsive while repairing/moving pgs? > You would not want to do this particular rebalancing (caused by shutting down the osd) to happen at all. I'm not certain, but having the cluster in "nout" and the full (that is actually full and not nearfull, right) OSD in and down would be preferable if that doesn't cause radosgw to seize up. The things that affect this the most have been discussed here before, this is an EXAMPLE (understand their effect before applying them) with very throttled down values: --- osd_max_backfills = 1 osd_backfill_scan_min = 4 osd_backfill_scan_max = 32 osd_recovery_max_active = 1 osd recovery threads = 1 osd recovery op priority = 1 --- > [root@cf01 ceph]# ceph -v > ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43) > > [root@cf01 ceph]# ceph -s > cluster 3469081f-9852-4b6e-b7ed-900e77c48bb5 > health HEALTH_WARN > 10 pgs backfill_toofull > 48 pgs backfilling > 31 pgs degraded > 1 pgs recovering > 31 pgs stuck degraded > 59 pgs stuck unclean > 30 pgs stuck undersized > 30 pgs undersized > recovery 6408175/131078852 objects degraded (4.889%) > recovery 69703039/131078852 objects misplaced (53.176%) > 1 near full osd(s) > norecover flag(s) set > monmap e1: 3 mons at > {cf01=10.4.10.211:6789/0,cf02=10.4.10.212:6789/0,cf03=10.4.10.213:6789/0} > election epoch 5826, quorum 0,1,2 cf01,cf02,cf03 > osdmap e5906: 20 osds: 19 up, 19 in; 58 remapped pgs > flags norecover > pgmap v12075461: 304 pgs, 17 pools, 23771 GB data, 45051 kobjects As others stated, way too little PGs, definitely not improving your balance, but not the root cause of your problems. > 50218 GB used, 39142 GB / 89360 GB avail > 6408175/131078852 objects degraded (4.889%) > 69703039/131078852 objects misplaced (53.176%) > 241 active+clean > 24 active+remapped+backfilling > 24 active+undersized+degraded+remapped+backfilling > 6 > active+undersized+degraded+remapped+backfill_toofull 4 > active+clean+scrubbing+deep 4 active+remapped+backfill_toofull > 1 active+recovering+degraded > > [root@cf01 ceph]# ceph --admin-daemon /run/ceph/ceph-mon.cf01.asok > config show | grep full > "mon_cache_target_full_warn_ratio": "0.66", > "mon_osd_full_ratio": "0.95", I'd crank that up to .98 and bring the osd back up and in first. The wait for everything to become stable again, as in all PGs being clean. Also to speed things up, disable scrubs for the time being. More below. > "mon_osd_nearfull_ratio": "0.85", > "paxos_stash_full_interval": "25", > "osd_backfill_full_ratio": "0.9", > "osd_pool_default_cache_target_full_ratio": "0.8", > "osd_debug_skip_full_check_in_backfill_reservation": "false", > "osd_failsafe_full_ratio": "0.97", > "osd_failsafe_nearfull_ratio": "0.9", > > [root@cf01 ceph]# ceph df > GLOBAL: > SIZE AVAIL RAW USED %RAW USED > 89360G 39137G 50223G 56.20 > POOLS: > NAME ID USED %USED MAX AVAIL > OBJECTS > vms 0 9907G 11.09 3505G > 2541252 > .rgw.root 1 848 0 3505G > 3 > .rgw.control 2 0 0 3505G > 8 > .rgw.gc 3 0 0 3505G > 32 > .rgw.buckets_cache 4 0 0 3505G > 0 > .rgw.buckets.index 5 0 0 3505G > 67102 > .rgw.buckets.extra 6 0 0 3505G > 6 > .log 7 121G 0.14 3505G > 91018 > .intent-log 8 0 0 3505G > 0 > .usage 9 0 0 3505G > 18 > .users 10 597 0 3505G > 36 > .users.email 11 0 0 3505G > 0 > .users.swift 12 0 0 3505G > 0 > .users.uid 13 11694 0 3505G > 57 > .rgw.buckets 14 13699G 15.33 3505G > 43421376 > .rgw 15 9256 0 3505G > 50 > one 17 43840M 0.05 2337G > 11153 > > > ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR > 0 1.00000 1.00000 5585G 2653G 2931G 47.51 0.85 > 1 1.00000 1.00000 5585G 2960G 2624G 53.02 0.94 > 2 1.00000 1.00000 5585G 3193G 2391G 57.18 1.02 > 10 1.00000 1.00000 3723G 2315G 1408G 62.18 1.11 > 16 1.00000 1.00000 3723G 763G 2959G 20.50 0.36 > 3 1.00000 1.00000 5585G 3559G 2025G 63.73 1.13 > 4 1.00000 1.00000 5585G 2354G 3230G 42.16 0.75 > 11 1.00000 1.00000 3723G 1302G 2420G 34.99 0.62 > 17 0.95000 0.95000 3723G 3388G 334G 91.01 1.62 > 12 1.00000 1.00000 3723G 2922G 800G 78.50 1.40 > 5 1.00000 1.00000 5585G 3972G 1613G 71.12 1.27 > 6 1.00000 1.00000 5585G 2975G 2609G 53.28 0.95 > 7 1.00000 1.00000 5585G 2208G 3376G 39.54 0.70 > 13 1.00000 1.00000 3723G 2092G 1631G 56.19 1.00 > 18 1.00000 1.00000 3723G 3144G 578G 84.45 1.50 > 8 1.00000 1.00000 5585G 2909G 2675G 52.10 0.93 > 9 1.00000 1.00000 5585G 3089G 2495G 55.31 0.98 > 14 0.95000 0 0 0 0 0 0 (this osd is full at > 97%) > 15 1.00000 1.00000 3723G 2629G 1093G 70.63 1.26 > 19 1.00000 1.00000 3723G 1781G 1941G 47.86 0.85 > TOTAL 89360G 50217G 39143G 56.20 > MIN/MAX VAR: 0/1.62 STDDEV: 16.80 > And this is were your problem stems from. How did you deploy this cluster? Normally the weight is the size of the OSD TB. By setting it all to 1 essentially, you're filling up your 4TB drives long before the 6TB ones. I assume OSD 14 is also a 4TB one, right? What you want to do is once everything is "stable" as outlined above is to very, VERY lightly adjust crush weights. Adjusting things will move things around, sometimes rather randomly and unexpectedly. It can (at least temporarily) put even more objects on your already overloaded OSDs, so limiting it to a really small amount (one or two PGs at a time hopefully) this shouldn't be too much of an issue. Of course you have far more data in your PGs than you ought to have, due to your low PG count. What you want to do is to attract PGs to the bigger OSDs and also keeping the host weight/ratios in mind. So in your case I would start with a: --- ceph osd crush reweight osd.0 1.001 --- Which should hopefully result in about one PG being moved to osd.0. Observe if that's the case, where it came from, etc. Then repeat this with osd.1 and 2, then 6 and 7, then 4. Track what's happening and keep doing this with the least utilized 6TB OSDs until you have the 4TB OSDs at sensible utilization levels. Again, keep in mind that the host weight (which is the sum of all OSDs on it) should not deviate too much from the other hosts at this point in time. Later on it should of course actually reflect reality. Once you have things where the 6TB OSDs have more or less the same relative utilization as the 4TB ones you could either leave things (crush weights) where they are or preferably take the plunge and set things "correctly". I'd do it by first setting nobackfill, then go and set the all the crush weights to the respective OSD size, for example: --- ceph osd crush reweight osd.0 5.585 --- Then after setting all those weights unset nobackfill and let things rebalance, if the ratios where close before this should result in relatively little data movement. You probably still want to do this during an off peak time of course. Then you get to think long and hard about increasing your PG count and change that. Of course you could do that also after your 4TB OSDs are no longer over-utilized. Regards, Christian > [root@cf01 ceph]# ceph osd tree > ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY > -1 19.89999 root default > -2 5.00000 host cf01 > 0 1.00000 osd.0 up 1.00000 1.00000 > 1 1.00000 osd.1 up 1.00000 1.00000 > 2 1.00000 osd.2 up 1.00000 1.00000 > 10 1.00000 osd.10 up 1.00000 1.00000 > 16 1.00000 osd.16 up 1.00000 1.00000 > -3 4.95000 host cf02 > 3 1.00000 osd.3 up 1.00000 1.00000 > 4 1.00000 osd.4 up 1.00000 1.00000 > 11 1.00000 osd.11 up 1.00000 1.00000 > 17 0.95000 osd.17 up 0.95000 1.00000 > 12 1.00000 osd.12 up 1.00000 1.00000 > -4 5.00000 host cf03 > 5 1.00000 osd.5 up 1.00000 1.00000 > 6 1.00000 osd.6 up 1.00000 1.00000 > 7 1.00000 osd.7 up 1.00000 1.00000 > 13 1.00000 osd.13 up 1.00000 1.00000 > 18 1.00000 osd.18 up 1.00000 1.00000 > -5 4.95000 host cf04 > 8 1.00000 osd.8 up 1.00000 1.00000 > 9 1.00000 osd.9 up 1.00000 1.00000 > 14 0.95000 osd.14 down 0 1.00000 > 15 1.00000 osd.15 up 1.00000 1.00000 > 19 1.00000 osd.19 up 1.00000 1.00000 > > # begin crush map > tunable choose_local_tries 0 > tunable choose_local_fallback_tries 0 > tunable choose_total_tries 50 > tunable chooseleaf_descend_once 1 > tunable straw_calc_version 1 > > # devices > device 0 osd.0 > device 1 osd.1 > device 2 osd.2 > device 3 osd.3 > device 4 osd.4 > device 5 osd.5 > device 6 osd.6 > device 7 osd.7 > device 8 osd.8 > device 9 osd.9 > device 10 osd.10 > device 11 osd.11 > device 12 osd.12 > device 13 osd.13 > device 14 osd.14 > device 15 osd.15 > device 16 osd.16 > device 17 osd.17 > device 18 osd.18 > device 19 osd.19 > > # types > type 0 osd > type 1 host > type 2 chassis > type 3 rack > type 4 row > type 5 pdu > type 6 pod > type 7 room > type 8 datacenter > type 9 region > type 10 root > > # buckets > host cf01 { > id -2 # do not change unnecessarily > # weight 5.000 > alg straw > hash 0 # rjenkins1 > item osd.0 weight 1.000 > item osd.1 weight 1.000 > item osd.2 weight 1.000 > item osd.10 weight 1.000 > item osd.16 weight 1.000 > } > host cf02 { > id -3 # do not change unnecessarily > # weight 4.950 > alg straw > hash 0 # rjenkins1 > item osd.3 weight 1.000 > item osd.4 weight 1.000 > item osd.11 weight 1.000 > item osd.17 weight 0.950 > item osd.12 weight 1.000 > } > host cf03 { > id -4 # do not change unnecessarily > # weight 5.000 > alg straw > hash 0 # rjenkins1 > item osd.5 weight 1.000 > item osd.6 weight 1.000 > item osd.7 weight 1.000 > item osd.13 weight 1.000 > item osd.18 weight 1.000 > } > host cf04 { > id -5 # do not change unnecessarily > # weight 4.950 > alg straw > hash 0 # rjenkins1 > item osd.8 weight 1.000 > item osd.9 weight 1.000 > item osd.14 weight 0.950 > item osd.15 weight 1.000 > item osd.19 weight 1.000 > } > root default { > id -1 # do not change unnecessarily > # weight 19.900 > alg straw > hash 0 # rjenkins1 > item cf01 weight 5.000 > item cf02 weight 4.950 > item cf03 weight 5.000 > item cf04 weight 4.950 > } > > # rules > rule replicated_ruleset { > ruleset 0 > type replicated > min_size 1 > max_size 10 > step take default > step chooseleaf firstn 0 type host > step emit > } > > # end crush map > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com