Re: Filling up ceph past 75%

Sean Sullivan <seapasulli@xxxxxxxxxxxx> · Sun, 28 Aug 2016 21:23:41 -0500

I've seen it in the past in the ML but I don't remember seeing it lately.  We recently had an ceph engineer come out from RH and he mentioned he hasn't seen this kind of disparity either which made me jump on here to double check as I thought it was a well known thing. 
So I'm not crazy and  the roughly 30% difference is normal? I've tried the osd by utilization function before (with other clusters)  and have been left with broken pgs(ones that seem to be stuck back filling) before so I've stayed away from it.  I saw that it has been redone but with past exposure I've been hesitant.  I'll give it another shot in a test instance and see how it goes. 

Thanks for your help as always Mr. Balzer.

On Aug 28, 2016 8:59 PM, "Christian Balzer" <chibi@xxxxxxx> wrote:

Hello,

On Sun, 28 Aug 2016 14:34:25 -0500 Sean Sullivan wrote:

> I was curious if anyone has filled ceph storage beyond 75%.

If you (re-)search the ML archives, you will find plenty of cases like

this, albeit most of them involuntary.

Same goes for uneven distribution.

> Admitedly we

> lost a single host due to power failure and are down 1 host until the

> replacement parts arrive but outside of that I am seeing disparity between

> the most and least full osd::

>

> ID  WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR

> MIN/MAX VAR: 0/1.26  STDDEV: 7.12

>                TOTAL 2178T 1625T  552T 74.63

>

> 559 4.54955  1.00000 3724G 2327G 1396G 62.50 0.84

> 193 2.48537  1.00000 3724G 3406G  317G 91.47 1.23

>

Those extremes, especially with the weights they have, look odd indeed.

Unless OSD 193 is in the rack which lost a node.

> The crush weights are really off right now but even with a default crush

> map I am seeing a similar spread::

>

> # osdmaptool --test-map-pgs --pool 1 /tmp/osdmap

>  avg 82 stddev 10.54 (0.128537x) (expected 9.05095 0.110377x))

>  min osd.336 55

>  max osd.54 115

>

> That's with a default weight of 3.000 across all osds. I was wondering if

> anyone can give me any tips on how to reach closer to 80% full.

>

> We have 630 osds (down one host right now but it will be back in in a week

> or so) spread across 3 racks of 7 hosts (30 osds each). Our data

> replication scheme is by rack and we only use S3 (so 98% of our data is in

> .rgw.buckets pool). We are on hammer (94.7) and using the hammer tunables.

>

What comes to mind here is that probably your split into 3 buckets (racks)

and then into 7 (hosts) is probably not helping the already rather fuzzy

CRUSH algorithm to come up with an even distribution.

Meaning that imbalances are likely to be amplified.

And dense (30 OSDs) storage servers amplify things of course when one goes

down.

So how many PGs in the bucket pool then?

With jewel (backport exists, check the ML archives) there's an improved

reweight-by-utilization script that can help with these things.

And I prefer to do this manually by using the (persistent) crush-reweight

to achieve a more even distribution.

For example on one cluster here I got the 18 HDD OSDs all within 100GB of

each other.

However having lost 3 of those OSDs 2 days ago the spread is now 300GB,

most likely NOT helped by the manual adjustments done earlier.

So your nice and evenly distributed cluster during normal state may be

worse off using custom weights when there is a significant OSD loss.

Christian

--

Christian Balzer        Network/Systems Engineer

chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications

http://www.gol.com/

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com