Pool size 2 min_size 1 Advisability?

erhvks@xxxxxxx (Edward Huyer) · Mon, 28 Jul 2014 14:24:02 +0000

> > Ceph has a default pool size of 3. Is it a bad idea to run a pool of
> > size 2? What about size 2 min_size 1?
> >
> min_size 1 is sensible, 2 obviously won't protect you against dual disk failures.
> Which happen and happen with near certainty once your cluster gets big
> enough.

I though I saw somewhere in the docs that there could be issues with min_size 1, but I can't seem to find it now.

> > I have a cluster I'm moving data into (on RBDs) that is full enough
> > with size 3 that I'm bumping into nearfull warnings. Part of that is
> > because of the amount of data, part is probably because of suboptimal
> > tuning (Proxmox VE doesn't support all the tuning options), and part
> > is probably because of unbalanced drive distribution and multiple
> > drive sizes.
> >
> > I'm hoping I'll be able to solve the drive size/distribution issue,
> > but in the mean time, what problems could the size and min_size
> > changes create (aside from the obvious issue of fewer replicas)?
> 
> I'd address all those issues (setting the correct weight for your OSDs).
> Because it is something you will need to do anyway down the road.
> Alternatively add more nodes and OSDs.

I don't think it's a weighting issue.  My weights seem sane (e.g., they are scaled according to drive size).  I think it's more an artifact arising from a combination of factors:
- A relatively small number of nodes
- Some of the nodes having additional OSDs
- Those additional OSDs being 500GB drives compared to the other OSDs being 1TB and 3TB drives
- Having to use older CRUSH tuneables
- The cluster being around 72% full with that pool set to size 3

Running ' ceph osd reweight-by-utilization' clears the issue up temporarily, but additional data inevitably causes certain OSDs to be overloaded again.

> While setting the replica down to 2 will "solve" your problem, it will also
> create another one besides the reduced redundancy:
> It will reshuffle all your data, slowing down your cluster (to the point of
> becoming unresponsive if it isn't designed and configured well).
> 
> Murphy might take those massive disk reads and writes as a clue to provide
> you with a double disk failure as well. ^o^

I actually already did the size 2 change on that pool before I sent my original email.  It was the only way I would get the data moved.  It didn't result in any data movement, just deletion.  When I get new drives I'll turn that knob back up.

Thanks for your input, by the way.