On Wed, 7 Jan 2015 00:54:13 +0900 Christian Balzer wrote: > On Tue, 6 Jan 2015 19:28:44 +0400 ivan babrou wrote: > > > Restarting OSD fixed PGs that were stuck: > > http://i.imgur.com/qd5vuzV.png > > > Good to hear that. > > Funny (not really) how often restarting OSDs fixes stuff like that. > > > Still OSD dis usage is very different, 150..250gb. Shall I double PGs > > again? > > > Not really, your settings are now if anything on the high side. > > Looking at your graph and data the current variance is clearly an > improvement over the the previous state. > Though far from ideal of course. > > I had a Firefly cluster that had non-optimal CRUSH tunables until 20 > minutes ago. > From the looks of it so far it will improve data placement, however it is > a very involved process (lots of data movement) and on top of that your > clients need to all support this. > So the re-balancing finished after moving 35% of my objects in about 1.5 hours. Clearly this is something that should be done during off-peak times and with potentially tuning the backfilling stuff down. Before getting to the results, a question for the devs: Why can't I see tunables_3 (or chooseleaf_vary_r) in either the running config or the "ceph osd crush show-tunables" output? --- { "choose_local_tries": 0, "choose_local_fallback_tries": 0, "choose_total_tries": 50, "chooseleaf_descend_once": 1, "profile": "bobtail", "optimal_tunables": 0, "legacy_tunables": 0, "require_feature_tunables": 1, "require_feature_tunables2": 1} --- Note that after setting things to optimal unsurprisingly the only thing that changes is the profile (to firefly) and optimal_tunables to 1. Now for the results, it reduced my variance from 30% to 25%. Actually nearly all OSDs are now within 15% of each other, but one OSD still is 10% larger than the average. It might turn out better for Ivan, but no guarantees of course. Given that even 5% should help and you've just reduced the data size to accommodate such a data rebalancing I'd go for it, provided your clients can handle this change as pointed out below. Christian > So let me get back to you tomorrow if that actually improved things > massively and you should read up at: > > http://ceph.com/docs/master/rados/operations/crush-map/#tunables > > In particular: > --- > WHICH CLIENT VERSIONS SUPPORT CRUSH_TUNABLES3 > > v0.78 (firefly) or later > Linux kernel version v3.15 or later (for the file system and RBD kernel > clients) --- > > Regards, > > Christian > > > On 6 January 2015 at 17:12, ivan babrou <ibobrik@xxxxxxxxx> wrote: > > > > > I deleted some old backups and GC is returning some disk space back. > > > But cluster state is still bad: > > > > > > 2015-01-06 13:35:54.102493 mon.0 [INF] pgmap v4017947: 5832 pgs: 23 > > > active+remapped+wait_backfill, 1 > > > active+remapped+wait_backfill+backfill_toofull, 2 > > > active+remapped+backfilling, 5806 active+clean; 9453 GB data, 22784 > > > GB used, 21750 GB / 46906 GB avail; 0 B/s wr, 78 op/s; 47275/8940623 > > > objects degraded (0.529%) > > > > > > Here's how disk utilization across OSDs looks like: > > > http://i.imgur.com/RWk9rvW.png > > > > > > Still one OSD is super-huge. I don't understand one PG is toofull if > > > the biggest OSD moved from 348gb to 294gb. > > > > > > root@51f2dde75901:~# ceph pg dump | grep '^[0-9]\+\.' | fgrep full > > > dumped all in format plain > > > 10.f26 1018 0 1811 0 2321324247 3261 3261 > > > active+remapped+wait_backfill+backfill_toofull 2015-01-05 > > > 15:06:49.504731 22897'359132 22897:48571 [91,1] 91 [8,40] 8 > > > 19248'358872 2015-01-05 11:58:03.062029 18326'358786 2014-12-31 > > > 23:43:02.285043 > > > > > > > > > On 6 January 2015 at 03:40, Christian Balzer <chibi@xxxxxxx> wrote: > > > > > >> On Mon, 5 Jan 2015 23:41:17 +0400 ivan babrou wrote: > > >> > > >> > Rebalancing is almost finished, but things got even worse: > > >> > http://i.imgur.com/0HOPZil.png > > >> > > > >> Looking at that graph only one OSD really kept growing and growing, > > >> everything else seems to be a lot denser, less varied than before, > > >> as one would have expected. > > >> > > >> Since I don't think you mentioned it before, what version of Ceph > > >> are you using and how are your CRUSH tunables set? > > >> > > > > > > I'm on 0.80.7 upgraded from 0.80.5. I didn't change CRUSH settings at > > > all. > > > > > > > Moreover, one pg is in > > > > active+remapped+wait_backfill+backfill_toofull > > >> > state: > > >> > > > >> > 2015-01-05 19:39:31.995665 mon.0 [INF] pgmap v3979616: 5832 pgs: > > >> > 23 active+remapped+wait_backfill, 1 > > >> > active+remapped+wait_backfill+backfill_toofull, 2 > > >> > active+remapped+backfilling, 5805 active+clean, 1 > > >> > active+remapped+backfill_toofull; 11210 GB data, 26174 GB used, > > >> > 18360 GB / 46906 GB avail; 65246/10590590 objects degraded > > >> > (0.616%) > > >> > > > >> > So at 55.8% disk space utilization ceph is full. That doesn't look > > >> > very well. > > >> > > > >> Indeed it doesn't. > > >> > > >> At this point you might want to manually lower the weight of that > > >> OSD (probably have to change the osd_backfill_full_ratio first to > > >> let it settle). > > >> > > > > > > I'm sure that's what ceph should do, not me. > > > > > > > > >> Thanks to Robert for bringing up the that blueprint for Hammer, lets > > >> hope it makes it in and gets backported. > > >> > > >> I sure hope somebody from the Ceph team will pipe up, but here's > > >> what I think is happening: > > >> You're using radosgw and I suppose many files are so similar named > > >> that they wind up clumping on the same PGs (OSDs). > > >> > > > > > > Nope, you are wrong here. PGs have roughly the same size, I mentioned > > > that in my first email. Now the biggest osd has 95 PGs and the > > > smallest one has 59 (I only counted PGs from the biggest pool). > > > > > > > > >> Now what I would _think_ could help with that is striping. > > >> > > >> However radosgw doesn't support the full striping options as RBD > > >> does. > > >> > > >> The only think you can modify is stripe (object) size, which > > >> defaults to 4MB. And I bet most of your RGW files are less than > > >> that in size, meaning they wind up on just one PG. > > >> > > > > > > Wrong again, I use that cluster for elasticsearch backups and docker > > > images. That stuff is usually much bigger than 4mb. > > > > > > Weird thing: I calculated osd sizes from "ceph pg dump" and they look > > > different from what really happens. Biggest OSD is 213gb and the > > > smallest is 131gb. GC isn't finished yet, but that seems very > > > different from what currently happens. > > > > > > # ceph pg dump | grep '^[0-9]\+\.' | awk '{ print $1, $6, $14 }' | > > > sed 's/[][,]/ /g' > pgs.txt > > > # cat pgs.txt | awk '{ sizes[$3] += $2; sizes[$4] += $2; } END { for > > > (o in sizes) { printf "%d %.2f gb\n", o, sizes[o] / 1024 / 1024 / > > > 1024; } }' | sort -n > > > > > > 0 198.18 gb > > > 1 188.74 gb > > > 2 165.94 gb > > > 3 143.28 gb > > > 4 193.37 gb > > > 5 185.87 gb > > > 6 146.46 gb > > > 7 170.67 gb > > > 8 213.93 gb > > > 9 200.22 gb > > > 10 144.05 gb > > > 11 164.44 gb > > > 12 158.27 gb > > > 13 204.96 gb > > > 14 190.04 gb > > > 15 158.48 gb > > > 16 172.86 gb > > > 17 157.05 gb > > > 18 179.82 gb > > > 19 175.86 gb > > > 20 192.63 gb > > > 21 179.82 gb > > > 22 181.30 gb > > > 23 172.97 gb > > > 24 141.21 gb > > > 25 165.63 gb > > > 26 139.87 gb > > > 27 184.18 gb > > > 28 160.75 gb > > > 29 185.88 gb > > > 30 186.13 gb > > > 31 163.38 gb > > > 32 182.92 gb > > > 33 134.82 gb > > > 34 186.56 gb > > > 35 166.91 gb > > > 36 163.49 gb > > > 37 205.59 gb > > > 38 199.26 gb > > > 39 151.43 gb > > > 40 173.23 gb > > > 41 200.54 gb > > > 42 198.07 gb > > > 43 150.48 gb > > > 44 165.54 gb > > > 45 193.87 gb > > > 46 177.05 gb > > > 47 167.97 gb > > > 48 186.68 gb > > > 49 177.68 gb > > > 50 204.94 gb > > > 51 184.52 gb > > > 52 160.11 gb > > > 53 163.33 gb > > > 54 137.28 gb > > > 55 168.97 gb > > > 56 193.08 gb > > > 57 176.87 gb > > > 58 166.36 gb > > > 59 171.98 gb > > > 60 175.50 gb > > > 61 199.39 gb > > > 62 175.31 gb > > > 63 164.54 gb > > > 64 171.26 gb > > > 65 154.86 gb > > > 66 166.39 gb > > > 67 145.15 gb > > > 68 162.55 gb > > > 69 181.13 gb > > > 70 181.18 gb > > > 71 197.67 gb > > > 72 164.79 gb > > > 73 143.85 gb > > > 74 169.17 gb > > > 75 183.67 gb > > > 76 143.16 gb > > > 77 171.91 gb > > > 78 167.75 gb > > > 79 158.36 gb > > > 80 198.83 gb > > > 81 158.26 gb > > > 82 182.52 gb > > > 83 204.65 gb > > > 84 179.78 gb > > > 85 170.02 gb > > > 86 185.70 gb > > > 87 138.91 gb > > > 88 190.66 gb > > > 89 209.43 gb > > > 90 193.54 gb > > > 91 185.00 gb > > > 92 170.31 gb > > > 93 140.11 gb > > > 94 161.69 gb > > > 95 194.53 gb > > > 96 184.35 gb > > > 97 158.74 gb > > > 98 184.39 gb > > > 99 174.83 gb > > > 100 183.30 gb > > > 101 179.82 gb > > > 102 160.84 gb > > > 103 163.29 gb > > > 104 131.92 gb > > > 105 158.09 gb > > > > > > > > > > > >> Again, would love to hear something from the devs on this one. > > >> > > >> Christian > > >> > On 5 January 2015 at 15:39, ivan babrou <ibobrik@xxxxxxxxx> wrote: > > >> > > > >> > > > > >> > > > > >> > > On 5 January 2015 at 14:20, Christian Balzer <chibi@xxxxxxx> > > >> > > wrote: > > >> > > > > >> > >> On Mon, 5 Jan 2015 14:04:28 +0400 ivan babrou wrote: > > >> > >> > > >> > >> > Hi! > > >> > >> > > > >> > >> > I have a cluster with 106 osds and disk usage is varying from > > >> > >> > 166gb to 316gb. Disk usage is highly correlated to number of > > >> > >> > pg per osd (no surprise here). Is there a reason for ceph to > > >> > >> > allocate more pg on some nodes? > > >> > >> > > > >> > >> In essence what Wido said, you're a bit low on PGs. > > >> > >> > > >> > >> Also given your current utilization, pool 14 is totally > > >> > >> oversize with 1024 PGs. You might want to re-create it with a > > >> > >> smaller size and double pool 0 to 512 PGs and 10 to 4096. > > >> > >> I assume you did raise the PGPs as well when changing the PGs, > > >> > >> right? > > >> > >> > > >> > > > > >> > > Yep, pg = pgp for all pools. Pool 14 is just for testing > > >> > > purposes, it might get large eventually. > > >> > > > > >> > > I followed you advice in doubling pools 0 and 10. It is > > >> > > rebalancing at 30% degraded now, but so far big osds become > > >> > > bigger and small become smaller: http://i.imgur.com/hJcX9Us.png. > > >> > > I hope that trend would change before rebalancing is complete. > > >> > > > > >> > > > > >> > >> And yeah, CEPH isn't particular good at balancing stuff by > > >> > >> itself, > > >> but > > >> > >> with sufficient PGs you ought to get the variance below/around > > >> > >> 30%. > > >> > >> > > >> > > > > >> > > Is this going to change in the future releases? > > >> > > > > >> > > > > >> > >> Christian > > >> > >> > > >> > >> > The biggest osds are 30, 42 and 69 (300gb+ each) and the > > >> > >> > smallest are > > >> > >> 87, > > >> > >> > 33 and 55 (170gb each). The biggest pool has 2048 pgs, pools > > >> > >> > with very little data has only 8 pgs. PG size in biggest pool > > >> > >> > is ~6gb (5.1..6.3 actually). > > >> > >> > > > >> > >> > Lack of balanced disk usage prevents me from using all the > > >> > >> > disk space. When the biggest osd is full, cluster does not > > >> > >> > accept writes anymore. > > >> > >> > > > >> > >> > Here's gist with info about my cluster: > > >> > >> > https://gist.github.com/bobrik/fb8ad1d7c38de0ff35ae > > >> > >> > > > >> > >> > > >> > >> > > >> > >> -- > > >> > >> Christian Balzer Network/Systems Engineer > > >> > >> chibi@xxxxxxx Global OnLine Japan/Fusion > > >> > >> Communications http://www.gol.com/ > > >> > >> > > >> > > > > >> > > > > >> > > > > >> > > -- > > >> > > Regards, Ian Babrou > > >> > > http://bobrik.name http://twitter.com/ibobrik skype:i.babrou > > >> > > > > >> > > > >> > > > >> > > > >> > > >> > > >> -- > > >> Christian Balzer Network/Systems Engineer > > >> chibi@xxxxxxx Global OnLine Japan/Fusion Communications > > >> http://www.gol.com/ > > >> > > > > > > > > > > > > -- > > > Regards, Ian Babrou > > > http://bobrik.name http://twitter.com/ibobrik skype:i.babrou > > > > > > > > > > > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com