Re: [Discussion] Enhancement for CRUSH rules

Gregory Farnum <greg@xxxxxxxxxxx> · Sat, 24 Nov 2012 08:54:20 -0800



On Thursday, November 22, 2012 at 7:38 PM, Chen, Xiaoxi wrote:
> Hi list,
> I am thinking about the possibility to add some primitive in CRUSH to meet the following user stories:
> A. "Same host", "Same rack"
> To balance between availability and performance ,one may like such a rule: 3 Replicas, Replica 1 and Replica 2 should in the same rack while Replica 3 reside in another rack.This is common because a typical d eployment in datacenter usually has much fewer uplink bandwidth than backbone bandwidth.
>  
> More aggressive guys may even want same host, since the most common failure is disk failure. And we have several disk (also means several OSDs) reside in the same physical machine. If we can place Replica 1 & 2 on the same host but replica 3 in somewhere else.It will not only reduce replication traffic but also saving a lot of time & bandwidth when disk failure happened and a recovery take place.
This is a feature we're definitely interested in! The difficulty with this (as I understand it) is that right now the CRUSH code is very parallel and even-handed — each instruction in a CRUSH rule is executed in sequence on every bucket it has in its set. Somebody would need to change it so that you could say something like:
step take root
step choose firstn -1 rack
step rack0 choose 2 device
step rackn choose 1 device
emit
  
> B."local"
> Although we cannot mount RBD volumes to where a OSD running at, but QEMU canbe used. This scenarios is really common in cloud computing. We have a large amount of compute-nodes, just plug in some disks and make the machines reused for Ceph cluster. To reduce network traffic and latency , if it is possible to have some placement-group-maybe 3 PG for a compute-node. Define the rules like: primary copy of the PG should (if possible) reside in localhost, the second replica should go different places
>  
> By doing this , a significant amount of network bandwidth & a RTT can be saved. What's more ,since read always go to primary, it will benefit a lot from such mechanism.
>  
> It looks to me that A is simpler but B seems much complex. Hoping for inputs.
This has existed previously, in the form of local PGs and CRUSH force-feeding. We ripped it out in the name of simplicity and due to never really finding a justifiable use-case — the one we had was Hadoop, and the rumors we heard out of the big shops were that for that workload, local writes weren't actually a win…
The other issue with it is that the data use over the cluster's disks can become pretty badly unbalanced since placement is no longer pseudo-random, and Ceph still needs a lot of work on full-disk management before that's something we want to allow.
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html