Pool size 2 min_size 1 Advisability?

greg@xxxxxxxxxxx (Gregory Farnum) · Mon, 28 Jul 2014 13:48:36 -0400



On Mon, Jul 28, 2014 at 12:14 PM, Christian Balzer <chibi at gol.com> wrote:
> On Mon, 28 Jul 2014 14:24:02 +0000 Edward Huyer wrote:
>
>> > > Ceph has a default pool size of 3. Is it a bad idea to run a pool of
>> > > size 2? What about size 2 min_size 1?
>> > >
>> > min_size 1 is sensible, 2 obviously won't protect you against dual
>> > disk failures. Which happen and happen with near certainty once your
>> > cluster gets big enough.
>>
>> I though I saw somewhere in the docs that there could be issues with
>> min_size 1, but I can't seem to find it now.
>>
>> > > I have a cluster I'm moving data into (on RBDs) that is full enough
>> > > with size 3 that I'm bumping into nearfull warnings. Part of that is
>> > > because of the amount of data, part is probably because of suboptimal
>> > > tuning (Proxmox VE doesn't support all the tuning options), and part
>> > > is probably because of unbalanced drive distribution and multiple
>> > > drive sizes.
>> > >
>> > > I'm hoping I'll be able to solve the drive size/distribution issue,
>> > > but in the mean time, what problems could the size and min_size
>> > > changes create (aside from the obvious issue of fewer replicas)?
>> >
>> > I'd address all those issues (setting the correct weight for your
>> > OSDs). Because it is something you will need to do anyway down the
>> > road. Alternatively add more nodes and OSDs.
>>
>> I don't think it's a weighting issue.  My weights seem sane (e.g., they
>> are scaled according to drive size).  I think it's more an artifact
>> arising from a combination of factors:
>> - A relatively small number of nodes
>> - Some of the nodes having additional OSDs
>> - Those additional OSDs being 500GB drives compared to the other OSDs
>> being 1TB and 3TB drives
>> - Having to use older CRUSH tuneables
>> - The cluster being around 72% full with that pool set to size 3
>>
>> Running ' ceph osd reweight-by-utilization' clears the issue up
>> temporarily, but additional data inevitably causes certain OSDs to be
>> overloaded again.
>>
> The only time I've ever seen this kind of uneven distribution is when
> using too little (and using the default formula with few OSDs might still
> be too little) PGs/PG_NUMs.
>
> Did you look into that?
>
>
>> > While setting the replica down to 2 will "solve" your problem, it will
>> > also create another one besides the reduced redundancy:
>> > It will reshuffle all your data, slowing down your cluster (to the
>> > point of becoming unresponsive if it isn't designed and configured
>> > well).
>> >
>> > Murphy might take those massive disk reads and writes as a clue to
>> > provide you with a double disk failure as well. ^o^
>>
>> I actually already did the size 2 change on that pool before I sent my
>> original email.  It was the only way I would get the data moved.  It
>> didn't result in any data movement, just deletion.  When I get new
>> drives I'll turn that knob back up.
>>
> Ahahaha, there you go.
> I actually changed my test cluster from 2 to 3 and was going to change it
> back when the data dance stopped, but you did beat me to it.
>
> This is quite (pleasantly) surprising, as fiddling with any CRUSH knob
> usually makes CEPH go into data shuffling overdrive.

Yep, this is deliberate ? the sizing knobs aren't used as CRUSH
inputs; it just impacts how often the CRUSH calculation is run.
Scaling that value up or down adds or removes values to the end of the
set of OSDs hosting a PG, but doesn't change the order they appear in.
Things that do shuffle data:
1) changing weights (obviously)
2) changing internal CRUSH parameters (for most users, this means
changing the tunables)
3) changing how the map looks (i.e., adding OSDs)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com