Re: Different disk usage on different OSDs

Christian Balzer <chibi@xxxxxxx> · Wed, 7 Jan 2015 00:54:13 +0900

On Tue, 6 Jan 2015 19:28:44 +0400 ivan babrou wrote:

> Restarting OSD fixed PGs that were stuck: http://i.imgur.com/qd5vuzV.png
> 
Good to hear that. 

Funny (not really) how often restarting OSDs fixes stuff like that.

> Still OSD dis usage is very different, 150..250gb. Shall I double PGs
> again?
> 
Not really, your settings are now if anything on the high side.

Looking at your graph and data the current variance is clearly an
improvement over the the previous state. 
Though far from ideal of course.

I had a Firefly cluster that had non-optimal CRUSH tunables until 20
minutes ago.
>From the looks of it so far it will improve data placement, however it is
a very involved process (lots of data movement) and on top of that your
clients need to all support this.

So let me get back to you tomorrow if that actually improved things
massively and you should read up at:

http://ceph.com/docs/master/rados/operations/crush-map/#tunables

In particular:
---
WHICH CLIENT VERSIONS SUPPORT CRUSH_TUNABLES3

v0.78 (firefly) or later
Linux kernel version v3.15 or later (for the file system and RBD kernel clients)
---

Regards,

Christian

> On 6 January 2015 at 17:12, ivan babrou <ibobrik@xxxxxxxxx> wrote:
> 
> > I deleted some old backups and GC is returning some disk space back.
> > But cluster state is still bad:
> >
> > 2015-01-06 13:35:54.102493 mon.0 [INF] pgmap v4017947: 5832 pgs: 23
> > active+remapped+wait_backfill, 1
> > active+remapped+wait_backfill+backfill_toofull, 2
> > active+remapped+backfilling, 5806 active+clean; 9453 GB data, 22784 GB
> > used, 21750 GB / 46906 GB avail; 0 B/s wr, 78 op/s; 47275/8940623
> > objects degraded (0.529%)
> >
> > Here's how disk utilization across OSDs looks like:
> > http://i.imgur.com/RWk9rvW.png
> >
> > Still one OSD is super-huge. I don't understand one PG is toofull if
> > the biggest OSD moved from 348gb to 294gb.
> >
> > root@51f2dde75901:~# ceph pg dump | grep '^[0-9]\+\.' | fgrep full
> > dumped all in format plain
> > 10.f26 1018 0 1811 0 2321324247 3261 3261
> > active+remapped+wait_backfill+backfill_toofull 2015-01-05
> > 15:06:49.504731 22897'359132 22897:48571 [91,1] 91 [8,40] 8
> > 19248'358872 2015-01-05 11:58:03.062029 18326'358786 2014-12-31
> > 23:43:02.285043
> >
> >
> > On 6 January 2015 at 03:40, Christian Balzer <chibi@xxxxxxx> wrote:
> >
> >> On Mon, 5 Jan 2015 23:41:17 +0400 ivan babrou wrote:
> >>
> >> > Rebalancing is almost finished, but things got even worse:
> >> > http://i.imgur.com/0HOPZil.png
> >> >
> >> Looking at that graph only one OSD really kept growing and growing,
> >> everything else seems to be a lot denser, less varied than before, as
> >> one would have expected.
> >>
> >> Since I don't think you mentioned it before, what version of Ceph are
> >> you using and how are your CRUSH tunables set?
> >>
> >
> > I'm on 0.80.7 upgraded from 0.80.5. I didn't change CRUSH settings at
> > all.
> >
> > > Moreover, one pg is in active+remapped+wait_backfill+backfill_toofull
> >> > state:
> >> >
> >> > 2015-01-05 19:39:31.995665 mon.0 [INF] pgmap v3979616: 5832 pgs: 23
> >> > active+remapped+wait_backfill, 1
> >> > active+remapped+wait_backfill+backfill_toofull, 2
> >> > active+remapped+backfilling, 5805 active+clean, 1
> >> > active+remapped+backfill_toofull; 11210 GB data, 26174 GB used,
> >> > 18360 GB / 46906 GB avail; 65246/10590590 objects degraded (0.616%)
> >> >
> >> > So at 55.8% disk space utilization ceph is full. That doesn't look
> >> > very well.
> >> >
> >> Indeed it doesn't.
> >>
> >> At this point you might want to manually lower the weight of that OSD
> >> (probably have to change the osd_backfill_full_ratio first to let it
> >> settle).
> >>
> >
> > I'm sure that's what ceph should do, not me.
> >
> >
> >> Thanks to Robert for bringing up the that blueprint for Hammer, lets
> >> hope it makes it in and gets backported.
> >>
> >> I sure hope somebody from the Ceph team will pipe up, but here's what
> >> I think is happening:
> >> You're using radosgw and I suppose many files are so similar named
> >> that they wind up clumping on the same PGs (OSDs).
> >>
> >
> > Nope, you are wrong here. PGs have roughly the same size, I mentioned
> > that in my first email. Now the biggest osd has 95 PGs and the
> > smallest one has 59 (I only counted PGs from the biggest pool).
> >
> >
> >> Now what I would _think_ could help with that is striping.
> >>
> >> However radosgw doesn't support the full striping options as RBD does.
> >>
> >> The only think you can modify is stripe (object) size, which defaults
> >> to 4MB. And I bet most of your RGW files are less than that in size,
> >> meaning they wind up on just one PG.
> >>
> >
> > Wrong again, I use that cluster for elasticsearch backups and docker
> > images. That stuff is usually much bigger than 4mb.
> >
> > Weird thing: I calculated osd sizes from "ceph pg dump" and they look
> > different from what really happens. Biggest OSD is 213gb and the
> > smallest is 131gb. GC isn't finished yet, but that seems very
> > different from what currently happens.
> >
> > # ceph pg dump | grep '^[0-9]\+\.' | awk '{ print $1, $6, $14 }' | sed
> > 's/[][,]/ /g' > pgs.txt
> > # cat pgs.txt | awk '{ sizes[$3] += $2; sizes[$4] += $2; } END { for
> > (o in sizes) { printf "%d %.2f gb\n", o, sizes[o] / 1024 / 1024 /
> > 1024; } }' | sort -n
> >
> > 0 198.18 gb
> > 1 188.74 gb
> > 2 165.94 gb
> > 3 143.28 gb
> > 4 193.37 gb
> > 5 185.87 gb
> > 6 146.46 gb
> > 7 170.67 gb
> > 8 213.93 gb
> > 9 200.22 gb
> > 10 144.05 gb
> > 11 164.44 gb
> > 12 158.27 gb
> > 13 204.96 gb
> > 14 190.04 gb
> > 15 158.48 gb
> > 16 172.86 gb
> > 17 157.05 gb
> > 18 179.82 gb
> > 19 175.86 gb
> > 20 192.63 gb
> > 21 179.82 gb
> > 22 181.30 gb
> > 23 172.97 gb
> > 24 141.21 gb
> > 25 165.63 gb
> > 26 139.87 gb
> > 27 184.18 gb
> > 28 160.75 gb
> > 29 185.88 gb
> > 30 186.13 gb
> > 31 163.38 gb
> > 32 182.92 gb
> > 33 134.82 gb
> > 34 186.56 gb
> > 35 166.91 gb
> > 36 163.49 gb
> > 37 205.59 gb
> > 38 199.26 gb
> > 39 151.43 gb
> > 40 173.23 gb
> > 41 200.54 gb
> > 42 198.07 gb
> > 43 150.48 gb
> > 44 165.54 gb
> > 45 193.87 gb
> > 46 177.05 gb
> > 47 167.97 gb
> > 48 186.68 gb
> > 49 177.68 gb
> > 50 204.94 gb
> > 51 184.52 gb
> > 52 160.11 gb
> > 53 163.33 gb
> > 54 137.28 gb
> > 55 168.97 gb
> > 56 193.08 gb
> > 57 176.87 gb
> > 58 166.36 gb
> > 59 171.98 gb
> > 60 175.50 gb
> > 61 199.39 gb
> > 62 175.31 gb
> > 63 164.54 gb
> > 64 171.26 gb
> > 65 154.86 gb
> > 66 166.39 gb
> > 67 145.15 gb
> > 68 162.55 gb
> > 69 181.13 gb
> > 70 181.18 gb
> > 71 197.67 gb
> > 72 164.79 gb
> > 73 143.85 gb
> > 74 169.17 gb
> > 75 183.67 gb
> > 76 143.16 gb
> > 77 171.91 gb
> > 78 167.75 gb
> > 79 158.36 gb
> > 80 198.83 gb
> > 81 158.26 gb
> > 82 182.52 gb
> > 83 204.65 gb
> > 84 179.78 gb
> > 85 170.02 gb
> > 86 185.70 gb
> > 87 138.91 gb
> > 88 190.66 gb
> > 89 209.43 gb
> > 90 193.54 gb
> > 91 185.00 gb
> > 92 170.31 gb
> > 93 140.11 gb
> > 94 161.69 gb
> > 95 194.53 gb
> > 96 184.35 gb
> > 97 158.74 gb
> > 98 184.39 gb
> > 99 174.83 gb
> > 100 183.30 gb
> > 101 179.82 gb
> > 102 160.84 gb
> > 103 163.29 gb
> > 104 131.92 gb
> > 105 158.09 gb
> >
> >
> >
> >> Again, would love to hear something from the devs on this one.
> >>
> >> Christian
> >> > On 5 January 2015 at 15:39, ivan babrou <ibobrik@xxxxxxxxx> wrote:
> >> >
> >> > >
> >> > >
> >> > > On 5 January 2015 at 14:20, Christian Balzer <chibi@xxxxxxx>
> >> > > wrote:
> >> > >
> >> > >> On Mon, 5 Jan 2015 14:04:28 +0400 ivan babrou wrote:
> >> > >>
> >> > >> > Hi!
> >> > >> >
> >> > >> > I have a cluster with 106 osds and disk usage is varying from
> >> > >> > 166gb to 316gb. Disk usage is highly correlated to number of
> >> > >> > pg per osd (no surprise here). Is there a reason for ceph to
> >> > >> > allocate more pg on some nodes?
> >> > >> >
> >> > >> In essence what Wido said, you're a bit low on PGs.
> >> > >>
> >> > >> Also given your current utilization, pool 14 is totally oversize
> >> > >> with 1024 PGs. You might want to re-create it with a smaller
> >> > >> size and double pool 0 to 512 PGs and 10 to 4096.
> >> > >> I assume you did raise the PGPs as well when changing the PGs,
> >> > >> right?
> >> > >>
> >> > >
> >> > > Yep, pg = pgp for all pools. Pool 14 is just for testing
> >> > > purposes, it might get large eventually.
> >> > >
> >> > > I followed you advice in doubling pools 0 and 10. It is
> >> > > rebalancing at 30% degraded now, but so far big osds become
> >> > > bigger and small become smaller: http://i.imgur.com/hJcX9Us.png.
> >> > > I hope that trend would change before rebalancing is complete.
> >> > >
> >> > >
> >> > >> And yeah, CEPH isn't particular good at balancing stuff by
> >> > >> itself,
> >> but
> >> > >> with sufficient PGs you ought to get the variance below/around
> >> > >> 30%.
> >> > >>
> >> > >
> >> > > Is this going to change in the future releases?
> >> > >
> >> > >
> >> > >> Christian
> >> > >>
> >> > >> > The biggest osds are 30, 42 and 69 (300gb+ each) and the
> >> > >> > smallest are
> >> > >> 87,
> >> > >> > 33 and 55 (170gb each). The biggest pool has 2048 pgs, pools
> >> > >> > with very little data has only 8 pgs. PG size in biggest pool
> >> > >> > is ~6gb (5.1..6.3 actually).
> >> > >> >
> >> > >> > Lack of balanced disk usage prevents me from using all the disk
> >> > >> > space. When the biggest osd is full, cluster does not accept
> >> > >> > writes anymore.
> >> > >> >
> >> > >> > Here's gist with info about my cluster:
> >> > >> > https://gist.github.com/bobrik/fb8ad1d7c38de0ff35ae
> >> > >> >
> >> > >>
> >> > >>
> >> > >> --
> >> > >> Christian Balzer        Network/Systems Engineer
> >> > >> chibi@xxxxxxx           Global OnLine Japan/Fusion Communications
> >> > >> http://www.gol.com/
> >> > >>
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Regards, Ian Babrou
> >> > > http://bobrik.name http://twitter.com/ibobrik skype:i.babrou
> >> > >
> >> >
> >> >
> >> >
> >>
> >>
> >> --
> >> Christian Balzer        Network/Systems Engineer
> >> chibi@xxxxxxx           Global OnLine Japan/Fusion Communications
> >> http://www.gol.com/
> >>
> >
> >
> >
> > --
> > Regards, Ian Babrou
> > http://bobrik.name http://twitter.com/ibobrik skype:i.babrou
> >
> 
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com