On Mon, Jul 28, 2014 at 12:14 PM, Christian Balzer <chibi at gol.com> wrote: > On Mon, 28 Jul 2014 14:24:02 +0000 Edward Huyer wrote: > >> > > Ceph has a default pool size of 3. Is it a bad idea to run a pool of >> > > size 2? What about size 2 min_size 1? >> > > >> > min_size 1 is sensible, 2 obviously won't protect you against dual >> > disk failures. Which happen and happen with near certainty once your >> > cluster gets big enough. >> >> I though I saw somewhere in the docs that there could be issues with >> min_size 1, but I can't seem to find it now. >> >> > > I have a cluster I'm moving data into (on RBDs) that is full enough >> > > with size 3 that I'm bumping into nearfull warnings. Part of that is >> > > because of the amount of data, part is probably because of suboptimal >> > > tuning (Proxmox VE doesn't support all the tuning options), and part >> > > is probably because of unbalanced drive distribution and multiple >> > > drive sizes. >> > > >> > > I'm hoping I'll be able to solve the drive size/distribution issue, >> > > but in the mean time, what problems could the size and min_size >> > > changes create (aside from the obvious issue of fewer replicas)? >> > >> > I'd address all those issues (setting the correct weight for your >> > OSDs). Because it is something you will need to do anyway down the >> > road. Alternatively add more nodes and OSDs. >> >> I don't think it's a weighting issue. My weights seem sane (e.g., they >> are scaled according to drive size). I think it's more an artifact >> arising from a combination of factors: >> - A relatively small number of nodes >> - Some of the nodes having additional OSDs >> - Those additional OSDs being 500GB drives compared to the other OSDs >> being 1TB and 3TB drives >> - Having to use older CRUSH tuneables >> - The cluster being around 72% full with that pool set to size 3 >> >> Running ' ceph osd reweight-by-utilization' clears the issue up >> temporarily, but additional data inevitably causes certain OSDs to be >> overloaded again. >> > The only time I've ever seen this kind of uneven distribution is when > using too little (and using the default formula with few OSDs might still > be too little) PGs/PG_NUMs. > > Did you look into that? > > >> > While setting the replica down to 2 will "solve" your problem, it will >> > also create another one besides the reduced redundancy: >> > It will reshuffle all your data, slowing down your cluster (to the >> > point of becoming unresponsive if it isn't designed and configured >> > well). >> > >> > Murphy might take those massive disk reads and writes as a clue to >> > provide you with a double disk failure as well. ^o^ >> >> I actually already did the size 2 change on that pool before I sent my >> original email. It was the only way I would get the data moved. It >> didn't result in any data movement, just deletion. When I get new >> drives I'll turn that knob back up. >> > Ahahaha, there you go. > I actually changed my test cluster from 2 to 3 and was going to change it > back when the data dance stopped, but you did beat me to it. > > This is quite (pleasantly) surprising, as fiddling with any CRUSH knob > usually makes CEPH go into data shuffling overdrive. Yep, this is deliberate ? the sizing knobs aren't used as CRUSH inputs; it just impacts how often the CRUSH calculation is run. Scaling that value up or down adds or removes values to the end of the set of OSDs hosting a PG, but doesn't change the order they appear in. Things that do shuffle data: 1) changing weights (obviously) 2) changing internal CRUSH parameters (for most users, this means changing the tunables) 3) changing how the map looks (i.e., adding OSDs) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com