Re: Questions about CRUSH

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Sage,

On Thu, 2010-10-07 at 13:50 -0700, Sage Weil wrote:
> Hi Wido!
> 
> On Thu, 7 Oct 2010, Wido den Hollander wrote:
> > Hi,
> > 
> > I'm working on a crushmap where I have my hosts spread out over 3 racks
> > (leafs).
> > 
> > I have 9 physical machines, each with one OSD, spread out over three
> > racks.
> > 
> > The replication level I intend to use is 3, my goal with this crushmap
> > is to prevent two replicas being stored in the same rack.
> > 
> > Now, this map seems fine to me, but what if one of the racks fails and
> > the cluster starts to fix itself, then I would get two replicas in the
> > same rack, wouldn't I?
> 
> Right.
> 
> > Is it better to have: leafs at root = (max replication level + 1) ?
> > 
> > So, if I have my replication level set to 3, I should have 4 racks with
> > each 3 OSD's, then the cluster could restore from a complete rack
> > failure, without compromising my data safety.
> > 
> > When a complete leaf (rack) fails, the other leafs should be able to
> > store all the data, so if my replication level is set to 3, I should
> > always have at least 1/3 of free space, otherwise a full recovery won't
> > be possible, correct? (OSD's run out of disk space).
> > 
> > Am I missing something here or is this the right approach?
> 
> Yeah, I think this is the right approach. 
> 
> > And I'm not completely sure about:
> > 
> > rule placein3racks {
> rule placeinNracks {
> >         ruleset 0
> >         type replicated
> >         min_size 2
> >         max_size 2
> 	min_size 2
> 	max_size 10
> >         step take root
> >         step chooseleaf firstn 0 type rack
> >         step emit
> > }
> >
> > Is that correct? Here I say that the first step should be to choose a
> > rack where the replica should be saved. Should I also specify to choose
> > a host afterwards?
> 
> The rule generalizes to N replicas, where N can be 2..10 (that's what the 
> min/max size fields are for).  And the chooseleaf line is correct.  That 
> chooses N leaves/devices that are nested beneath N distinct racks.  Which 
> is what you want!
> 
> You could also do
> 
> 	step take root
> 	step choose firstn 0 type rack
> 	step choose firstn 1 type device
> 	step emit

Shouldn't that be:

	step take root
	step choose firstn 0 type rack
	step choose firstn 1 type host
	step choose firstn 2 type device
	step emit

Or am I wrong here?

> 
> That would choose N racks, and then for each rack, choose a nested device.  
> The problem is when one of the racks it chooses has no (or few) online 
> devices beneath it, we fail to find a usable device, and the result set 
> will have <N devices.  Chooseleaf doesn't have that problem.

So chooseleaf rack should be safe enough in this case?

> 
> sage

Wido

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux