Re: Different disk usage on different OSDs

ivan babrou <ibobrik@xxxxxxxxx> · Tue, 6 Jan 2015 18:12:25 +0400

I deleted some old backups and GC is returning some disk space back. But cluster state is still bad:
2015-01-06 13:35:54.102493 mon.0 [INF] pgmap v4017947: 5832 pgs: 23 active+remapped+wait_backfill, 1 active+remapped+wait_backfill+backfill_toofull, 2 active+remapped+backfilling, 5806 active+clean; 9453 GB data, 22784 GB used, 21750 GB / 46906 GB avail; 0 B/s wr, 78 op/s; 47275/8940623 objects degraded (0.529%)

Here's how disk utilization across OSDs looks like: http://i.imgur.com/RWk9rvW.png

Still one OSD is super-huge. I don't understand one PG is toofull if the biggest OSD moved from 348gb to 294gb.

root@51f2dde75901:~# ceph pg dump | grep '^[0-9]\+\.' | fgrep full
dumped all in format plain
10.f26	1018	0	1811	0	2321324247	3261	3261	active+remapped+wait_backfill+backfill_toofull	2015-01-05 15:06:49.504731	22897'359132	22897:48571	[91,1]	91	[8,40]	8	19248'358872	2015-01-05 11:58:03.062029	18326'358786	2014-12-31 23:43:02.285043

On 6 January 2015 at 03:40, Christian Balzer <chibi@xxxxxxx> wrote:
On Mon, 5 Jan 2015 23:41:17 +0400 ivan babrou wrote:

> Rebalancing is almost finished, but things got even worse:

> http://i.imgur.com/0HOPZil.png

>

Looking at that graph only one OSD really kept growing and growing,

everything else seems to be a lot denser, less varied than before, as one

would have expected.

Since I don't think you mentioned it before, what version of Ceph are you

using and how are your CRUSH tunables set?

I'm on 0.80.7 upgraded from 0.80.5. I didn't change CRUSH settings at all.

> Moreover, one pg is in active+remapped+wait_backfill+backfill_toofull

> state:

>

> 2015-01-05 19:39:31.995665 mon.0 [INF] pgmap v3979616: 5832 pgs: 23

> active+remapped+wait_backfill, 1

> active+remapped+wait_backfill+backfill_toofull, 2

> active+remapped+backfilling, 5805 active+clean, 1

> active+remapped+backfill_toofull; 11210 GB data, 26174 GB used, 18360

> GB / 46906 GB avail; 65246/10590590 objects degraded (0.616%)

>

> So at 55.8% disk space utilization ceph is full. That doesn't look very

> well.

>

Indeed it doesn't.

At this point you might want to manually lower the weight of that OSD

(probably have to change the osd_backfill_full_ratio first to let it

settle).

I'm sure that's what ceph should do, not me.

Thanks to Robert for bringing up the that blueprint for Hammer, lets

hope it makes it in and gets backported.

I sure hope somebody from the Ceph team will pipe up, but here's what I

think is happening:

You're using radosgw and I suppose many files are so similar named that

they wind up clumping on the same PGs (OSDs).

Nope, you are wrong here. PGs have roughly the same size, I mentioned that in my first email. Now the biggest osd has 95 PGs and the smallest one has 59 (I only counted PGs from the biggest pool).

Now what I would _think_ could help with that is striping.

However radosgw doesn't support the full striping options as RBD does.

The only think you can modify is stripe (object) size, which defaults to

4MB. And I bet most of your RGW files are less than that in size, meaning

they wind up on just one PG.

Wrong again, I use that cluster for elasticsearch backups and docker images. That stuff is usually much bigger than 4mb.

Weird thing: I calculated osd sizes from "ceph pg dump" and they look different from what really happens. Biggest OSD is 213gb and the smallest is 131gb. GC isn't finished yet, but that seems very different from what currently happens. 

# ceph pg dump | grep '^[0-9]\+\.' | awk '{ print $1, $6, $14 }' | sed 's/[][,]/ /g' > pgs.txt
# cat pgs.txt | awk '{ sizes[$3] += $2; sizes[$4] += $2; } END { for (o in sizes) { printf "%d %.2f gb\n", o, sizes[o] / 1024 / 1024 / 1024; } }' | sort -n

0 198.18 gb
1 188.74 gb
2 165.94 gb
3 143.28 gb
4 193.37 gb
5 185.87 gb
6 146.46 gb
7 170.67 gb
8 213.93 gb
9 200.22 gb
10 144.05 gb
11 164.44 gb
12 158.27 gb
13 204.96 gb
14 190.04 gb
15 158.48 gb
16 172.86 gb
17 157.05 gb
18 179.82 gb
19 175.86 gb
20 192.63 gb
21 179.82 gb
22 181.30 gb
23 172.97 gb
24 141.21 gb
25 165.63 gb
26 139.87 gb
27 184.18 gb
28 160.75 gb
29 185.88 gb
30 186.13 gb
31 163.38 gb
32 182.92 gb
33 134.82 gb
34 186.56 gb
35 166.91 gb
36 163.49 gb
37 205.59 gb
38 199.26 gb
39 151.43 gb
40 173.23 gb
41 200.54 gb
42 198.07 gb
43 150.48 gb
44 165.54 gb
45 193.87 gb
46 177.05 gb
47 167.97 gb
48 186.68 gb
49 177.68 gb
50 204.94 gb
51 184.52 gb
52 160.11 gb
53 163.33 gb
54 137.28 gb
55 168.97 gb
56 193.08 gb
57 176.87 gb
58 166.36 gb
59 171.98 gb
60 175.50 gb
61 199.39 gb
62 175.31 gb
63 164.54 gb
64 171.26 gb
65 154.86 gb
66 166.39 gb
67 145.15 gb
68 162.55 gb
69 181.13 gb
70 181.18 gb
71 197.67 gb
72 164.79 gb
73 143.85 gb
74 169.17 gb
75 183.67 gb
76 143.16 gb
77 171.91 gb
78 167.75 gb
79 158.36 gb
80 198.83 gb
81 158.26 gb
82 182.52 gb
83 204.65 gb
84 179.78 gb
85 170.02 gb
86 185.70 gb
87 138.91 gb
88 190.66 gb
89 209.43 gb
90 193.54 gb
91 185.00 gb
92 170.31 gb
93 140.11 gb
94 161.69 gb
95 194.53 gb
96 184.35 gb
97 158.74 gb
98 184.39 gb
99 174.83 gb
100 183.30 gb
101 179.82 gb
102 160.84 gb
103 163.29 gb
104 131.92 gb
105 158.09 gb

Again, would love to hear something from the devs on this one.

Christian

> On 5 January 2015 at 15:39, ivan babrou <ibobrik@xxxxxxxxx> wrote:

>

> >

> >

> > On 5 January 2015 at 14:20, Christian Balzer <chibi@xxxxxxx> wrote:

> >

> >> On Mon, 5 Jan 2015 14:04:28 +0400 ivan babrou wrote:

> >>

> >> > Hi!

> >> >

> >> > I have a cluster with 106 osds and disk usage is varying from 166gb

> >> > to 316gb. Disk usage is highly correlated to number of pg per osd

> >> > (no surprise here). Is there a reason for ceph to allocate more pg

> >> > on some nodes?

> >> >

> >> In essence what Wido said, you're a bit low on PGs.

> >>

> >> Also given your current utilization, pool 14 is totally oversize with

> >> 1024 PGs. You might want to re-create it with a smaller size and

> >> double pool 0 to 512 PGs and 10 to 4096.

> >> I assume you did raise the PGPs as well when changing the PGs, right?

> >>

> >

> > Yep, pg = pgp for all pools. Pool 14 is just for testing purposes, it

> > might get large eventually.

> >

> > I followed you advice in doubling pools 0 and 10. It is rebalancing at

> > 30% degraded now, but so far big osds become bigger and small become

> > smaller: http://i.imgur.com/hJcX9Us.png. I hope that trend would

> > change before rebalancing is complete.

> >

> >

> >> And yeah, CEPH isn't particular good at balancing stuff by itself, but

> >> with sufficient PGs you ought to get the variance below/around 30%.

> >>

> >

> > Is this going to change in the future releases?

> >

> >

> >> Christian

> >>

> >> > The biggest osds are 30, 42 and 69 (300gb+ each) and the smallest

> >> > are

> >> 87,

> >> > 33 and 55 (170gb each). The biggest pool has 2048 pgs, pools with

> >> > very little data has only 8 pgs. PG size in biggest pool is ~6gb

> >> > (5.1..6.3 actually).

> >> >

> >> > Lack of balanced disk usage prevents me from using all the disk

> >> > space. When the biggest osd is full, cluster does not accept writes

> >> > anymore.

> >> >

> >> > Here's gist with info about my cluster:

> >> > https://gist.github.com/bobrik/fb8ad1d7c38de0ff35ae

> >> >

> >>

> >>

> >> --

> >> Christian Balzer        Network/Systems Engineer

> >> chibi@xxxxxxx           Global OnLine Japan/Fusion Communications

> >> http://www.gol.com/

> >>

> >

> >

> >

> > --

> > Regards, Ian Babrou

> > http://bobrik.name http://twitter.com/ibobrik skype:i.babrou

> >

>

>

>

--

Christian Balzer        Network/Systems Engineer

chibi@xxxxxxx           Global OnLine Japan/Fusion Communications

http://www.gol.com/

-- 
Regards, Ian Babrou
http://bobrik.name http://twitter.com/ibobrik skype:i.babrou

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com