Re: Question regarding CRUSH algorithm

Richard Hesketh <richard.hesketh@xxxxxxxxxxxx> · Fri, 17 Feb 2017 11:50:23 +0000

On 16/02/17 20:44, girish kenkere wrote:
> Thanks David,
> 
> Its not quiet what i was looking for. Let me explain my question in more detail -
> 
> This is excerpt from Crush paper, this explains how crush algo running on each client/osd maps pg to an osd during the write operation[lets assume].
> 
> /"Tree buckets are structured as a weighted binary search tree with items at the leaves. Each interior node knows the total weight of its left and right subtrees and is labeled according to a fixed strategy (described below). In order to select an item within a bucket, CRUSH starts at the root of the tree and calculates the hash of the input key x, replica number r, the bucket identifier, and the label at the current tree node (initially the root). The result is compared to the weight ratio of the left and right subtrees to decide which child node to visit next. This process is repeated until a leaf node is reached, at which point the associated item in the bucket is chosen. Only logn hashes and node comparisons are needed to locate an item.:"/
> 
>  My question is along the way the tree structure changes, weights of the nodes change and some nodes even go away. In that case, how are future reads lead to pg to same osd mapping? Its not cached anywhere, same algo runs for every future read - what i am missing is how it picks the same osd[where data resides] every time. With a modified crush map, won't we end up with different leaf node if we apply same algo? 
> 
> Thanks
> Girish
> 
> On Thu, Feb 16, 2017 at 12:05 PM, David Turner <david.turner@xxxxxxxxxxxxxxxx <mailto:david.turner@xxxxxxxxxxxxxxxx>> wrote:
> 
>     As a piece to the puzzle, the client always has an up to date osd map (which includes the crush map).  If it's out of date, then it has to get a new one before it can request to read or write to the cluster.  That way the client will never have old information and if you add or remove storage, the client will always have the most up to date map to know where the current copies of the files are.
> 
>     This can cause slow downs in your cluster performance if you are updating your osdmap frequently, which can be caused by deleting a lot of snapshots as an example.

>     ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>     *From:* ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx <mailto:ceph-users-bounces@xxxxxxxxxxxxxx>] on behalf of girish kenkere [kngenius@xxxxxxxxx <mailto:kngenius@xxxxxxxxx>]
>     *Sent:* Thursday, February 16, 2017 12:43 PM
>     *To:* ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     *Subject:*  Question regarding CRUSH algorithm
> 
>     Hi, I have a question regarding CRUSH algorithm - please let me know how this works. CRUSH paper talks about how given an object we select OSD via two mapping - first one is obj to PG and then PG to OSD. 
> 
>     This PG to OSD mapping is something i dont understand. It uses pg#, cluster map, and placement rules. How is it guaranteed to return correct OSD for future reads after the cluster map/placement rules has changed due to nodes coming and out?
> 
>     Thanks
>     Girish

I think there is confusion over when the CRUSH algorithm is being run. It's my understanding that the object->PG mapping is always dynamically computed, and that's pretty simple (hash the object ID, take it modulo [num_pgs in pool], prepend pool ID, 8.0b's your uncle), but the PG->OSD mapping is only computed when new PGs are created or the CRUSH map changes. The result of that computation is stored in the cluster map and then locating a particular PG is a matter of looking it up in the map, not recalculating its location - PG placement is pseudorandom and nondeterministic anyway, so that would never work.

So - the client DOES run CRUSH to find the location of an object, but only in the sense of working out which PG it's in. It then looks up the PG in the cluster map (which includes the osdmap that David mentioned).

See http://docs.ceph.com/docs/master/architecture/#mapping-pgs-to-osds and http://docs.ceph.com/docs/master/architecture/#cluster-map and note that the cluster map contains the mapping between OSDs and PGs.

Rich

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com