Pool size 2 min_size 1 Advisability?

chibi@xxxxxxx (Christian Balzer) · Tue, 29 Jul 2014 01:14:20 +0900

On Mon, 28 Jul 2014 14:24:02 +0000 Edward Huyer wrote:

> > > Ceph has a default pool size of 3. Is it a bad idea to run a pool of
> > > size 2? What about size 2 min_size 1?
> > >
> > min_size 1 is sensible, 2 obviously won't protect you against dual
> > disk failures. Which happen and happen with near certainty once your
> > cluster gets big enough.
> 
> I though I saw somewhere in the docs that there could be issues with
> min_size 1, but I can't seem to find it now.
> 
> > > I have a cluster I'm moving data into (on RBDs) that is full enough
> > > with size 3 that I'm bumping into nearfull warnings. Part of that is
> > > because of the amount of data, part is probably because of suboptimal
> > > tuning (Proxmox VE doesn't support all the tuning options), and part
> > > is probably because of unbalanced drive distribution and multiple
> > > drive sizes.
> > >
> > > I'm hoping I'll be able to solve the drive size/distribution issue,
> > > but in the mean time, what problems could the size and min_size
> > > changes create (aside from the obvious issue of fewer replicas)?
> > 
> > I'd address all those issues (setting the correct weight for your
> > OSDs). Because it is something you will need to do anyway down the
> > road. Alternatively add more nodes and OSDs.
> 
> I don't think it's a weighting issue.  My weights seem sane (e.g., they
> are scaled according to drive size).  I think it's more an artifact
> arising from a combination of factors:
> - A relatively small number of nodes
> - Some of the nodes having additional OSDs
> - Those additional OSDs being 500GB drives compared to the other OSDs
> being 1TB and 3TB drives
> - Having to use older CRUSH tuneables
> - The cluster being around 72% full with that pool set to size 3
> 
> Running ' ceph osd reweight-by-utilization' clears the issue up
> temporarily, but additional data inevitably causes certain OSDs to be
> overloaded again.
> 
The only time I've ever seen this kind of uneven distribution is when
using too little (and using the default formula with few OSDs might still
be too little) PGs/PG_NUMs.

Did you look into that?

> > While setting the replica down to 2 will "solve" your problem, it will
> > also create another one besides the reduced redundancy:
> > It will reshuffle all your data, slowing down your cluster (to the
> > point of becoming unresponsive if it isn't designed and configured
> > well).
> > 
> > Murphy might take those massive disk reads and writes as a clue to
> > provide you with a double disk failure as well. ^o^
> 
> I actually already did the size 2 change on that pool before I sent my
> original email.  It was the only way I would get the data moved.  It
> didn't result in any data movement, just deletion.  When I get new
> drives I'll turn that knob back up.
> 
Ahahaha, there you go. 
I actually changed my test cluster from 2 to 3 and was going to change it
back when the data dance stopped, but you did beat me to it.

This is quite (pleasantly) surprising, as fiddling with any CRUSH knob
usually makes CEPH go into data shuffling overdrive.

> Thanks for your input, by the way.
> 
You're quite welcome, glad to hear it worked out that way.

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/