Re: Different disk usage on different OSDs

Christian Balzer <chibi@xxxxxxx> · Wed, 7 Jan 2015 10:46:22 +0900

On Wed, 7 Jan 2015 00:54:13 +0900 Christian Balzer wrote:

> On Tue, 6 Jan 2015 19:28:44 +0400 ivan babrou wrote:
> 
> > Restarting OSD fixed PGs that were stuck:
> > http://i.imgur.com/qd5vuzV.png
> > 
> Good to hear that. 
> 
> Funny (not really) how often restarting OSDs fixes stuff like that.
> 
> > Still OSD dis usage is very different, 150..250gb. Shall I double PGs
> > again?
> > 
> Not really, your settings are now if anything on the high side.
> 
> Looking at your graph and data the current variance is clearly an
> improvement over the the previous state. 
> Though far from ideal of course.
> 
> I had a Firefly cluster that had non-optimal CRUSH tunables until 20
> minutes ago.
> From the looks of it so far it will improve data placement, however it is
> a very involved process (lots of data movement) and on top of that your
> clients need to all support this.
> 

So the re-balancing finished after moving 35% of my objects in about 1.5
hours. 
Clearly this is something that should be done during off-peak times and
with potentially tuning the backfilling stuff down.

Before getting to the results, a question for the devs:
Why can't I see tunables_3 (or chooseleaf_vary_r) in either the running
config or the "ceph osd crush show-tunables" output?
---
{ "choose_local_tries": 0,
  "choose_local_fallback_tries": 0,
  "choose_total_tries": 50,
  "chooseleaf_descend_once": 1,
  "profile": "bobtail",
  "optimal_tunables": 0,
  "legacy_tunables": 0,
  "require_feature_tunables": 1,
  "require_feature_tunables2": 1}
---
Note that after setting things to optimal unsurprisingly the only thing
that changes is the profile (to firefly) and optimal_tunables to 1.

Now for the results, it reduced my variance from 30% to 25%. 
Actually nearly all OSDs are now within 15% of each other, but one OSD
still is 10% larger than the average.

It might turn out better for Ivan, but no guarantees of course. 
Given that even 5% should help and you've just reduced the data size to
accommodate such a data rebalancing I'd go for it, provided your clients
can handle this change as pointed out below.

Christian

> So let me get back to you tomorrow if that actually improved things
> massively and you should read up at:
> 
> http://ceph.com/docs/master/rados/operations/crush-map/#tunables
> 
> In particular:
> ---
> WHICH CLIENT VERSIONS SUPPORT CRUSH_TUNABLES3
> 
> v0.78 (firefly) or later
> Linux kernel version v3.15 or later (for the file system and RBD kernel
> clients) ---
> 
> Regards,
> 
> Christian
> 
> > On 6 January 2015 at 17:12, ivan babrou <ibobrik@xxxxxxxxx> wrote:
> > 
> > > I deleted some old backups and GC is returning some disk space back.
> > > But cluster state is still bad:
> > >
> > > 2015-01-06 13:35:54.102493 mon.0 [INF] pgmap v4017947: 5832 pgs: 23
> > > active+remapped+wait_backfill, 1
> > > active+remapped+wait_backfill+backfill_toofull, 2
> > > active+remapped+backfilling, 5806 active+clean; 9453 GB data, 22784
> > > GB used, 21750 GB / 46906 GB avail; 0 B/s wr, 78 op/s; 47275/8940623
> > > objects degraded (0.529%)
> > >
> > > Here's how disk utilization across OSDs looks like:
> > > http://i.imgur.com/RWk9rvW.png
> > >
> > > Still one OSD is super-huge. I don't understand one PG is toofull if
> > > the biggest OSD moved from 348gb to 294gb.
> > >
> > > root@51f2dde75901:~# ceph pg dump | grep '^[0-9]\+\.' | fgrep full
> > > dumped all in format plain
> > > 10.f26 1018 0 1811 0 2321324247 3261 3261
> > > active+remapped+wait_backfill+backfill_toofull 2015-01-05
> > > 15:06:49.504731 22897'359132 22897:48571 [91,1] 91 [8,40] 8
> > > 19248'358872 2015-01-05 11:58:03.062029 18326'358786 2014-12-31
> > > 23:43:02.285043
> > >
> > >
> > > On 6 January 2015 at 03:40, Christian Balzer <chibi@xxxxxxx> wrote:
> > >
> > >> On Mon, 5 Jan 2015 23:41:17 +0400 ivan babrou wrote:
> > >>
> > >> > Rebalancing is almost finished, but things got even worse:
> > >> > http://i.imgur.com/0HOPZil.png
> > >> >
> > >> Looking at that graph only one OSD really kept growing and growing,
> > >> everything else seems to be a lot denser, less varied than before,
> > >> as one would have expected.
> > >>
> > >> Since I don't think you mentioned it before, what version of Ceph
> > >> are you using and how are your CRUSH tunables set?
> > >>
> > >
> > > I'm on 0.80.7 upgraded from 0.80.5. I didn't change CRUSH settings at
> > > all.
> > >
> > > > Moreover, one pg is in
> > > > active+remapped+wait_backfill+backfill_toofull
> > >> > state:
> > >> >
> > >> > 2015-01-05 19:39:31.995665 mon.0 [INF] pgmap v3979616: 5832 pgs:
> > >> > 23 active+remapped+wait_backfill, 1
> > >> > active+remapped+wait_backfill+backfill_toofull, 2
> > >> > active+remapped+backfilling, 5805 active+clean, 1
> > >> > active+remapped+backfill_toofull; 11210 GB data, 26174 GB used,
> > >> > 18360 GB / 46906 GB avail; 65246/10590590 objects degraded
> > >> > (0.616%)
> > >> >
> > >> > So at 55.8% disk space utilization ceph is full. That doesn't look
> > >> > very well.
> > >> >
> > >> Indeed it doesn't.
> > >>
> > >> At this point you might want to manually lower the weight of that
> > >> OSD (probably have to change the osd_backfill_full_ratio first to
> > >> let it settle).
> > >>
> > >
> > > I'm sure that's what ceph should do, not me.
> > >
> > >
> > >> Thanks to Robert for bringing up the that blueprint for Hammer, lets
> > >> hope it makes it in and gets backported.
> > >>
> > >> I sure hope somebody from the Ceph team will pipe up, but here's
> > >> what I think is happening:
> > >> You're using radosgw and I suppose many files are so similar named
> > >> that they wind up clumping on the same PGs (OSDs).
> > >>
> > >
> > > Nope, you are wrong here. PGs have roughly the same size, I mentioned
> > > that in my first email. Now the biggest osd has 95 PGs and the
> > > smallest one has 59 (I only counted PGs from the biggest pool).
> > >
> > >
> > >> Now what I would _think_ could help with that is striping.
> > >>
> > >> However radosgw doesn't support the full striping options as RBD
> > >> does.
> > >>
> > >> The only think you can modify is stripe (object) size, which
> > >> defaults to 4MB. And I bet most of your RGW files are less than
> > >> that in size, meaning they wind up on just one PG.
> > >>
> > >
> > > Wrong again, I use that cluster for elasticsearch backups and docker
> > > images. That stuff is usually much bigger than 4mb.
> > >
> > > Weird thing: I calculated osd sizes from "ceph pg dump" and they look
> > > different from what really happens. Biggest OSD is 213gb and the
> > > smallest is 131gb. GC isn't finished yet, but that seems very
> > > different from what currently happens.
> > >
> > > # ceph pg dump | grep '^[0-9]\+\.' | awk '{ print $1, $6, $14 }' |
> > > sed 's/[][,]/ /g' > pgs.txt
> > > # cat pgs.txt | awk '{ sizes[$3] += $2; sizes[$4] += $2; } END { for
> > > (o in sizes) { printf "%d %.2f gb\n", o, sizes[o] / 1024 / 1024 /
> > > 1024; } }' | sort -n
> > >
> > > 0 198.18 gb
> > > 1 188.74 gb
> > > 2 165.94 gb
> > > 3 143.28 gb
> > > 4 193.37 gb
> > > 5 185.87 gb
> > > 6 146.46 gb
> > > 7 170.67 gb
> > > 8 213.93 gb
> > > 9 200.22 gb
> > > 10 144.05 gb
> > > 11 164.44 gb
> > > 12 158.27 gb
> > > 13 204.96 gb
> > > 14 190.04 gb
> > > 15 158.48 gb
> > > 16 172.86 gb
> > > 17 157.05 gb
> > > 18 179.82 gb
> > > 19 175.86 gb
> > > 20 192.63 gb
> > > 21 179.82 gb
> > > 22 181.30 gb
> > > 23 172.97 gb
> > > 24 141.21 gb
> > > 25 165.63 gb
> > > 26 139.87 gb
> > > 27 184.18 gb
> > > 28 160.75 gb
> > > 29 185.88 gb
> > > 30 186.13 gb
> > > 31 163.38 gb
> > > 32 182.92 gb
> > > 33 134.82 gb
> > > 34 186.56 gb
> > > 35 166.91 gb
> > > 36 163.49 gb
> > > 37 205.59 gb
> > > 38 199.26 gb
> > > 39 151.43 gb
> > > 40 173.23 gb
> > > 41 200.54 gb
> > > 42 198.07 gb
> > > 43 150.48 gb
> > > 44 165.54 gb
> > > 45 193.87 gb
> > > 46 177.05 gb
> > > 47 167.97 gb
> > > 48 186.68 gb
> > > 49 177.68 gb
> > > 50 204.94 gb
> > > 51 184.52 gb
> > > 52 160.11 gb
> > > 53 163.33 gb
> > > 54 137.28 gb
> > > 55 168.97 gb
> > > 56 193.08 gb
> > > 57 176.87 gb
> > > 58 166.36 gb
> > > 59 171.98 gb
> > > 60 175.50 gb
> > > 61 199.39 gb
> > > 62 175.31 gb
> > > 63 164.54 gb
> > > 64 171.26 gb
> > > 65 154.86 gb
> > > 66 166.39 gb
> > > 67 145.15 gb
> > > 68 162.55 gb
> > > 69 181.13 gb
> > > 70 181.18 gb
> > > 71 197.67 gb
> > > 72 164.79 gb
> > > 73 143.85 gb
> > > 74 169.17 gb
> > > 75 183.67 gb
> > > 76 143.16 gb
> > > 77 171.91 gb
> > > 78 167.75 gb
> > > 79 158.36 gb
> > > 80 198.83 gb
> > > 81 158.26 gb
> > > 82 182.52 gb
> > > 83 204.65 gb
> > > 84 179.78 gb
> > > 85 170.02 gb
> > > 86 185.70 gb
> > > 87 138.91 gb
> > > 88 190.66 gb
> > > 89 209.43 gb
> > > 90 193.54 gb
> > > 91 185.00 gb
> > > 92 170.31 gb
> > > 93 140.11 gb
> > > 94 161.69 gb
> > > 95 194.53 gb
> > > 96 184.35 gb
> > > 97 158.74 gb
> > > 98 184.39 gb
> > > 99 174.83 gb
> > > 100 183.30 gb
> > > 101 179.82 gb
> > > 102 160.84 gb
> > > 103 163.29 gb
> > > 104 131.92 gb
> > > 105 158.09 gb
> > >
> > >
> > >
> > >> Again, would love to hear something from the devs on this one.
> > >>
> > >> Christian
> > >> > On 5 January 2015 at 15:39, ivan babrou <ibobrik@xxxxxxxxx> wrote:
> > >> >
> > >> > >
> > >> > >
> > >> > > On 5 January 2015 at 14:20, Christian Balzer <chibi@xxxxxxx>
> > >> > > wrote:
> > >> > >
> > >> > >> On Mon, 5 Jan 2015 14:04:28 +0400 ivan babrou wrote:
> > >> > >>
> > >> > >> > Hi!
> > >> > >> >
> > >> > >> > I have a cluster with 106 osds and disk usage is varying from
> > >> > >> > 166gb to 316gb. Disk usage is highly correlated to number of
> > >> > >> > pg per osd (no surprise here). Is there a reason for ceph to
> > >> > >> > allocate more pg on some nodes?
> > >> > >> >
> > >> > >> In essence what Wido said, you're a bit low on PGs.
> > >> > >>
> > >> > >> Also given your current utilization, pool 14 is totally
> > >> > >> oversize with 1024 PGs. You might want to re-create it with a
> > >> > >> smaller size and double pool 0 to 512 PGs and 10 to 4096.
> > >> > >> I assume you did raise the PGPs as well when changing the PGs,
> > >> > >> right?
> > >> > >>
> > >> > >
> > >> > > Yep, pg = pgp for all pools. Pool 14 is just for testing
> > >> > > purposes, it might get large eventually.
> > >> > >
> > >> > > I followed you advice in doubling pools 0 and 10. It is
> > >> > > rebalancing at 30% degraded now, but so far big osds become
> > >> > > bigger and small become smaller: http://i.imgur.com/hJcX9Us.png.
> > >> > > I hope that trend would change before rebalancing is complete.
> > >> > >
> > >> > >
> > >> > >> And yeah, CEPH isn't particular good at balancing stuff by
> > >> > >> itself,
> > >> but
> > >> > >> with sufficient PGs you ought to get the variance below/around
> > >> > >> 30%.
> > >> > >>
> > >> > >
> > >> > > Is this going to change in the future releases?
> > >> > >
> > >> > >
> > >> > >> Christian
> > >> > >>
> > >> > >> > The biggest osds are 30, 42 and 69 (300gb+ each) and the
> > >> > >> > smallest are
> > >> > >> 87,
> > >> > >> > 33 and 55 (170gb each). The biggest pool has 2048 pgs, pools
> > >> > >> > with very little data has only 8 pgs. PG size in biggest pool
> > >> > >> > is ~6gb (5.1..6.3 actually).
> > >> > >> >
> > >> > >> > Lack of balanced disk usage prevents me from using all the
> > >> > >> > disk space. When the biggest osd is full, cluster does not
> > >> > >> > accept writes anymore.
> > >> > >> >
> > >> > >> > Here's gist with info about my cluster:
> > >> > >> > https://gist.github.com/bobrik/fb8ad1d7c38de0ff35ae
> > >> > >> >
> > >> > >>
> > >> > >>
> > >> > >> --
> > >> > >> Christian Balzer        Network/Systems Engineer
> > >> > >> chibi@xxxxxxx           Global OnLine Japan/Fusion
> > >> > >> Communications http://www.gol.com/
> > >> > >>
> > >> > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > > Regards, Ian Babrou
> > >> > > http://bobrik.name http://twitter.com/ibobrik skype:i.babrou
> > >> > >
> > >> >
> > >> >
> > >> >
> > >>
> > >>
> > >> --
> > >> Christian Balzer        Network/Systems Engineer
> > >> chibi@xxxxxxx           Global OnLine Japan/Fusion Communications
> > >> http://www.gol.com/
> > >>
> > >
> > >
> > >
> > > --
> > > Regards, Ian Babrou
> > > http://bobrik.name http://twitter.com/ibobrik skype:i.babrou
> > >
> > 
> > 
> > 
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com