Re: Different disk usage on different OSDs

Christian Balzer <chibi@xxxxxxx> · Tue, 6 Jan 2015 09:40:28 +0900

On Mon, 5 Jan 2015 23:41:17 +0400 ivan babrou wrote:

> Rebalancing is almost finished, but things got even worse:
> http://i.imgur.com/0HOPZil.png
>
Looking at that graph only one OSD really kept growing and growing,
everything else seems to be a lot denser, less varied than before, as one
would have expected.

Since I don't think you mentioned it before, what version of Ceph are you
using and how are your CRUSH tunables set?

> Moreover, one pg is in active+remapped+wait_backfill+backfill_toofull
> state:
> 
> 2015-01-05 19:39:31.995665 mon.0 [INF] pgmap v3979616: 5832 pgs: 23
> active+remapped+wait_backfill, 1
> active+remapped+wait_backfill+backfill_toofull, 2
> active+remapped+backfilling, 5805 active+clean, 1
> active+remapped+backfill_toofull; 11210 GB data, 26174 GB used, 18360
> GB / 46906 GB avail; 65246/10590590 objects degraded (0.616%)
> 
> So at 55.8% disk space utilization ceph is full. That doesn't look very
> well.
> 
Indeed it doesn't.

At this point you might want to manually lower the weight of that OSD
(probably have to change the osd_backfill_full_ratio first to let it
settle).

Thanks to Robert for bringing up the that blueprint for Hammer, lets
hope it makes it in and gets backported. 

I sure hope somebody from the Ceph team will pipe up, but here's what I
think is happening:
You're using radosgw and I suppose many files are so similar named that
they wind up clumping on the same PGs (OSDs).

Now what I would _think_ could help with that is striping.

However radosgw doesn't support the full striping options as RBD does.

The only think you can modify is stripe (object) size, which defaults to
4MB. And I bet most of your RGW files are less than that in size, meaning
they wind up on just one PG. 

Again, would love to hear something from the devs on this one.

Christian
> On 5 January 2015 at 15:39, ivan babrou <ibobrik@xxxxxxxxx> wrote:
> 
> >
> >
> > On 5 January 2015 at 14:20, Christian Balzer <chibi@xxxxxxx> wrote:
> >
> >> On Mon, 5 Jan 2015 14:04:28 +0400 ivan babrou wrote:
> >>
> >> > Hi!
> >> >
> >> > I have a cluster with 106 osds and disk usage is varying from 166gb
> >> > to 316gb. Disk usage is highly correlated to number of pg per osd
> >> > (no surprise here). Is there a reason for ceph to allocate more pg
> >> > on some nodes?
> >> >
> >> In essence what Wido said, you're a bit low on PGs.
> >>
> >> Also given your current utilization, pool 14 is totally oversize with
> >> 1024 PGs. You might want to re-create it with a smaller size and
> >> double pool 0 to 512 PGs and 10 to 4096.
> >> I assume you did raise the PGPs as well when changing the PGs, right?
> >>
> >
> > Yep, pg = pgp for all pools. Pool 14 is just for testing purposes, it
> > might get large eventually.
> >
> > I followed you advice in doubling pools 0 and 10. It is rebalancing at
> > 30% degraded now, but so far big osds become bigger and small become
> > smaller: http://i.imgur.com/hJcX9Us.png. I hope that trend would
> > change before rebalancing is complete.
> >
> >
> >> And yeah, CEPH isn't particular good at balancing stuff by itself, but
> >> with sufficient PGs you ought to get the variance below/around 30%.
> >>
> >
> > Is this going to change in the future releases?
> >
> >
> >> Christian
> >>
> >> > The biggest osds are 30, 42 and 69 (300gb+ each) and the smallest
> >> > are
> >> 87,
> >> > 33 and 55 (170gb each). The biggest pool has 2048 pgs, pools with
> >> > very little data has only 8 pgs. PG size in biggest pool is ~6gb
> >> > (5.1..6.3 actually).
> >> >
> >> > Lack of balanced disk usage prevents me from using all the disk
> >> > space. When the biggest osd is full, cluster does not accept writes
> >> > anymore.
> >> >
> >> > Here's gist with info about my cluster:
> >> > https://gist.github.com/bobrik/fb8ad1d7c38de0ff35ae
> >> >
> >>
> >>
> >> --
> >> Christian Balzer        Network/Systems Engineer
> >> chibi@xxxxxxx           Global OnLine Japan/Fusion Communications
> >> http://www.gol.com/
> >>
> >
> >
> >
> > --
> > Regards, Ian Babrou
> > http://bobrik.name http://twitter.com/ibobrik skype:i.babrou
> >
> 
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com