Restarting OSD fixed PGs that were stuck: http://i.imgur.com/qd5vuzV.png
Still OSD dis usage is very different, 150..250gb. Shall I double PGs again?
On 6 January 2015 at 17:12, ivan babrou <ibobrik@xxxxxxxxx> wrote:
I deleted some old backups and GC is returning some disk space back. But cluster state is still bad:2015-01-06 13:35:54.102493 mon.0 [INF] pgmap v4017947: 5832 pgs: 23 active+remapped+wait_backfill, 1 active+remapped+wait_backfill+backfill_toofull, 2 active+remapped+backfilling, 5806 active+clean; 9453 GB data, 22784 GB used, 21750 GB / 46906 GB avail; 0 B/s wr, 78 op/s; 47275/8940623 objects degraded (0.529%)Here's how disk utilization across OSDs looks like: http://i.imgur.com/RWk9rvW.pngStill one OSD is super-huge. I don't understand one PG is toofull if the biggest OSD moved from 348gb to 294gb.root@51f2dde75901:~# ceph pg dump | grep '^[0-9]\+\.' | fgrep fulldumped all in format plain10.f26 1018 0 1811 0 2321324247 3261 3261 active+remapped+wait_backfill+backfill_toofull 2015-01-05 15:06:49.504731 22897'359132 22897:48571 [91,1] 91 [8,40] 8 19248'358872 2015-01-05 11:58:03.062029 18326'358786 2014-12-31 23:43:02.285043On 6 January 2015 at 03:40, Christian Balzer <chibi@xxxxxxx> wrote:On Mon, 5 Jan 2015 23:41:17 +0400 ivan babrou wrote:
> Rebalancing is almost finished, but things got even worse:
> http://i.imgur.com/0HOPZil.png
>
Looking at that graph only one OSD really kept growing and growing,
everything else seems to be a lot denser, less varied than before, as one
would have expected.
Since I don't think you mentioned it before, what version of Ceph are you
using and how are your CRUSH tunables set?I'm on 0.80.7 upgraded from 0.80.5. I didn't change CRUSH settings at all.> Moreover, one pg is in active+remapped+wait_backfill+backfill_toofull
> state:
>
> 2015-01-05 19:39:31.995665 mon.0 [INF] pgmap v3979616: 5832 pgs: 23
> active+remapped+wait_backfill, 1
> active+remapped+wait_backfill+backfill_toofull, 2
> active+remapped+backfilling, 5805 active+clean, 1
> active+remapped+backfill_toofull; 11210 GB data, 26174 GB used, 18360
> GB / 46906 GB avail; 65246/10590590 objects degraded (0.616%)
>
> So at 55.8% disk space utilization ceph is full. That doesn't look very
> well.
>
Indeed it doesn't.
At this point you might want to manually lower the weight of that OSD
(probably have to change the osd_backfill_full_ratio first to let it
settle).I'm sure that's what ceph should do, not me.Thanks to Robert for bringing up the that blueprint for Hammer, lets
hope it makes it in and gets backported.
I sure hope somebody from the Ceph team will pipe up, but here's what I
think is happening:
You're using radosgw and I suppose many files are so similar named that
they wind up clumping on the same PGs (OSDs).Nope, you are wrong here. PGs have roughly the same size, I mentioned that in my first email. Now the biggest osd has 95 PGs and the smallest one has 59 (I only counted PGs from the biggest pool).Now what I would _think_ could help with that is striping.
However radosgw doesn't support the full striping options as RBD does.
The only think you can modify is stripe (object) size, which defaults to
4MB. And I bet most of your RGW files are less than that in size, meaning
they wind up on just one PG.Wrong again, I use that cluster for elasticsearch backups and docker images. That stuff is usually much bigger than 4mb.Weird thing: I calculated osd sizes from "ceph pg dump" and they look different from what really happens. Biggest OSD is 213gb and the smallest is 131gb. GC isn't finished yet, but that seems very different from what currently happens.# ceph pg dump | grep '^[0-9]\+\.' | awk '{ print $1, $6, $14 }' | sed 's/[][,]/ /g' > pgs.txt# cat pgs.txt | awk '{ sizes[$3] += $2; sizes[$4] += $2; } END { for (o in sizes) { printf "%d %.2f gb\n", o, sizes[o] / 1024 / 1024 / 1024; } }' | sort -n0 198.18 gb1 188.74 gb2 165.94 gb3 143.28 gb4 193.37 gb5 185.87 gb6 146.46 gb7 170.67 gb8 213.93 gb9 200.22 gb10 144.05 gb11 164.44 gb12 158.27 gb13 204.96 gb14 190.04 gb15 158.48 gb16 172.86 gb17 157.05 gb18 179.82 gb19 175.86 gb20 192.63 gb21 179.82 gb22 181.30 gb23 172.97 gb24 141.21 gb25 165.63 gb26 139.87 gb27 184.18 gb28 160.75 gb29 185.88 gb30 186.13 gb31 163.38 gb32 182.92 gb33 134.82 gb34 186.56 gb35 166.91 gb36 163.49 gb37 205.59 gb38 199.26 gb39 151.43 gb40 173.23 gb41 200.54 gb42 198.07 gb43 150.48 gb44 165.54 gb45 193.87 gb46 177.05 gb47 167.97 gb48 186.68 gb49 177.68 gb50 204.94 gb51 184.52 gb52 160.11 gb53 163.33 gb54 137.28 gb55 168.97 gb56 193.08 gb57 176.87 gb58 166.36 gb59 171.98 gb60 175.50 gb61 199.39 gb62 175.31 gb63 164.54 gb64 171.26 gb65 154.86 gb66 166.39 gb67 145.15 gb68 162.55 gb69 181.13 gb70 181.18 gb71 197.67 gb72 164.79 gb73 143.85 gb74 169.17 gb75 183.67 gb76 143.16 gb77 171.91 gb78 167.75 gb79 158.36 gb80 198.83 gb81 158.26 gb82 182.52 gb83 204.65 gb84 179.78 gb85 170.02 gb86 185.70 gb87 138.91 gb88 190.66 gb89 209.43 gb90 193.54 gb91 185.00 gb92 170.31 gb93 140.11 gb94 161.69 gb95 194.53 gb96 184.35 gb97 158.74 gb98 184.39 gb99 174.83 gb100 183.30 gb101 179.82 gb102 160.84 gb103 163.29 gb104 131.92 gb105 158.09 gbAgain, would love to hear something from the devs on this one.
Christian
> On 5 January 2015 at 15:39, ivan babrou <ibobrik@xxxxxxxxx> wrote:
>
> >
> >
> > On 5 January 2015 at 14:20, Christian Balzer <chibi@xxxxxxx> wrote:
> >
> >> On Mon, 5 Jan 2015 14:04:28 +0400 ivan babrou wrote:
> >>
> >> > Hi!
> >> >
> >> > I have a cluster with 106 osds and disk usage is varying from 166gb
> >> > to 316gb. Disk usage is highly correlated to number of pg per osd
> >> > (no surprise here). Is there a reason for ceph to allocate more pg
> >> > on some nodes?
> >> >
> >> In essence what Wido said, you're a bit low on PGs.
> >>
> >> Also given your current utilization, pool 14 is totally oversize with
> >> 1024 PGs. You might want to re-create it with a smaller size and
> >> double pool 0 to 512 PGs and 10 to 4096.
> >> I assume you did raise the PGPs as well when changing the PGs, right?
> >>
> >
> > Yep, pg = pgp for all pools. Pool 14 is just for testing purposes, it
> > might get large eventually.
> >
> > I followed you advice in doubling pools 0 and 10. It is rebalancing at
> > 30% degraded now, but so far big osds become bigger and small become
> > smaller: http://i.imgur.com/hJcX9Us.png. I hope that trend would
> > change before rebalancing is complete.
> >
> >
> >> And yeah, CEPH isn't particular good at balancing stuff by itself, but
> >> with sufficient PGs you ought to get the variance below/around 30%.
> >>
> >
> > Is this going to change in the future releases?
> >
> >
> >> Christian
> >>
> >> > The biggest osds are 30, 42 and 69 (300gb+ each) and the smallest
> >> > are
> >> 87,
> >> > 33 and 55 (170gb each). The biggest pool has 2048 pgs, pools with
> >> > very little data has only 8 pgs. PG size in biggest pool is ~6gb
> >> > (5.1..6.3 actually).
> >> >
> >> > Lack of balanced disk usage prevents me from using all the disk
> >> > space. When the biggest osd is full, cluster does not accept writes
> >> > anymore.
> >> >
> >> > Here's gist with info about my cluster:
> >> > https://gist.github.com/bobrik/fb8ad1d7c38de0ff35ae
> >> >
> >>
> >>
> >> --
> >> Christian Balzer Network/Systems Engineer
> >> chibi@xxxxxxx Global OnLine Japan/Fusion Communications
> >> http://www.gol.com/
> >>
> >
> >
> >
> > --
> > Regards, Ian Babrou
> > http://bobrik.name http://twitter.com/ibobrik skype:i.babrou
> >
>
>
>
--
Christian Balzer Network/Systems Engineer
chibi@xxxxxxx Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com