Re: Filling up ceph past 75%

Christian Balzer <chibi@xxxxxxx> · Mon, 29 Aug 2016 12:16:46 +0900

Hello,

On Sun, 28 Aug 2016 21:23:41 -0500 Sean Sullivan wrote:

> I've seen it in the past in the ML but I don't remember seeing it lately.
> We recently had an ceph engineer come out from RH and he mentioned he
> hasn't seen this kind of disparity either which made me jump on here to
> double check as I thought it was a well known thing.
> 
As I said, this looks extreme.
But w/o all OSD weights being equal and a sanity check of the CRUSH rules,
it may be "working as expected". 

> So I'm not crazy and  the roughly 30% difference is normal? 
That's roughly what I'd expect to see from default clusters, yes.
On a 4 node, 24 OSD cluster w/o any weight changes (and all OSDs having the
same weight) I see between 76 and 106 PGs per OSD.
And this is RBD only, just one pool, so all these PGs are relevant when it
comes to space usage. 

Not perfect, but a far cry from what you mentioned below.

You may benefit from more PGs in general (you seem to be well below 100
per OSD given the output of your osdmaptool run) and having the "correct"
number of them assigned to your largest pool(s).

>I've tried the
> osd by utilization function before (with other clusters)  and have been
> left with broken pgs(ones that seem to be stuck back filling) before so
> I've stayed away from it.  
That of course shouldn't happen, unless you were approaching corner cases
(tunables and too few OSDs for CRUSH to make up its mind). 

Christian

>I saw that it has been redone but with past
> exposure I've been hesitant.  I'll give it another shot in a test instance
> and see how it goes.
> 
> Thanks for your help as always Mr. Balzer.
> 
> On Aug 28, 2016 8:59 PM, "Christian Balzer" <chibi@xxxxxxx> wrote:
> 
> >
> > Hello,
> >
> > On Sun, 28 Aug 2016 14:34:25 -0500 Sean Sullivan wrote:
> >
> > > I was curious if anyone has filled ceph storage beyond 75%.
> >
> > If you (re-)search the ML archives, you will find plenty of cases like
> > this, albeit most of them involuntary.
> > Same goes for uneven distribution.
> >
> > > Admitedly we
> > > lost a single host due to power failure and are down 1 host until the
> > > replacement parts arrive but outside of that I am seeing disparity
> > between
> > > the most and least full osd::
> > >
> > > ID  WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR
> > > MIN/MAX VAR: 0/1.26  STDDEV: 7.12
> > >                TOTAL 2178T 1625T  552T 74.63
> > >
> > > 559 4.54955  1.00000 3724G 2327G 1396G 62.50 0.84
> > > 193 2.48537  1.00000 3724G 3406G  317G 91.47 1.23
> > >
> > Those extremes, especially with the weights they have, look odd indeed.
> > Unless OSD 193 is in the rack which lost a node.
> >
> > > The crush weights are really off right now but even with a default crush
> > > map I am seeing a similar spread::
> > >
> > > # osdmaptool --test-map-pgs --pool 1 /tmp/osdmap
> > >  avg 82 stddev 10.54 (0.128537x) (expected 9.05095 0.110377x))
> > >  min osd.336 55
> > >  max osd.54 115
> > >
> > > That's with a default weight of 3.000 across all osds. I was wondering if
> > > anyone can give me any tips on how to reach closer to 80% full.
> > >
> > > We have 630 osds (down one host right now but it will be back in in a
> > week
> > > or so) spread across 3 racks of 7 hosts (30 osds each). Our data
> > > replication scheme is by rack and we only use S3 (so 98% of our data is
> > in
> > > .rgw.buckets pool). We are on hammer (94.7) and using the hammer
> > tunables.
> > >
> > What comes to mind here is that probably your split into 3 buckets (racks)
> > and then into 7 (hosts) is probably not helping the already rather fuzzy
> > CRUSH algorithm to come up with an even distribution.
> > Meaning that imbalances are likely to be amplified.
> >
> > And dense (30 OSDs) storage servers amplify things of course when one goes
> > down.
> >
> > So how many PGs in the bucket pool then?
> >
> > With jewel (backport exists, check the ML archives) there's an improved
> > reweight-by-utilization script that can help with these things.
> > And I prefer to do this manually by using the (persistent) crush-reweight
> > to achieve a more even distribution.
> >
> > For example on one cluster here I got the 18 HDD OSDs all within 100GB of
> > each other.
> >
> > However having lost 3 of those OSDs 2 days ago the spread is now 300GB,
> > most likely NOT helped by the manual adjustments done earlier.
> > So your nice and evenly distributed cluster during normal state may be
> > worse off using custom weights when there is a significant OSD loss.
> >
> > Christian
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> >

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com