Re: crush multiweight implementation details

Sage Weil <sweil@xxxxxxxxxx> · Tue, 11 Apr 2017 13:31:12 +0000 (UTC)

On Tue, 11 Apr 2017, Loic Dachary wrote:
> On 04/10/2017 04:53 PM, Loic Dachary wrote:
> > On 04/10/2017 04:44 PM, Sage Weil wrote:
> >> On Mon, 10 Apr 2017, Loic Dachary wrote:
> >>> On 04/10/2017 04:11 PM, Sage Weil wrote:
> >>>> On Mon, 10 Apr 2017, Loic Dachary wrote:
> >>>>> Hi Sage,
> >>>>>
> >>>>> We could have:
> >>>>>
> >>>>> struct crush_choose_arg {
> >>>>>   __u32 bucket_id;
> >>>>>   __u32 num_items;
> >>>>>   __u32 *ids; // override the bucket items for placement
> >>>>>   __u32 num_positions;
> >>>>>   __u32 *weights; // size is num_positions*num_items
> >>>>> };
> >>>>>  
> >>>>> struct crush_choose_arg_map {
> >>>>>   struct crush_choose_arg *args;
> >>>>>   __u32 size;
> >>>>> };
> >>>>>
> >>>>> and
> >>>>>
> >>>>> void crush_init_workspace(const struct crush_map *m, struct crush_choose_arg_map *arg_map, void *v) {
> >>>>> ...
> >>>>> if (m->buckets[b]->id == arg_map[b]->bucket_id)
> >>>>>    w->work[b]->arg = arg_map[b];
> >>>>> ...
> >>>>> }
> >>>>>
> >>>>> with
> >>>>>
> >>>>> struct crush_work_bucket {
> >>>>>     __u32 perm_x; /* @x for which *perm is defined */
> >>>>>     __u32 perm_n; /* num elements of *perm that are permuted/defined */
> >>>>>     __u32 *perm;  /* Permutation of the bucket's items */
> >>>>>     struct crush_choose_arg *arg;
> >>>>> };
> >>>>>
> >>>>> There would be no need to change the code path since crush_bucket_choose 
> >>>>> already is given the crush_work_bucket. And crush_init_workspace already 
> >>>>> has logic that is algorithm dependent. And all the sanity checks could 
> >>>>> be done in crush_init_workspace so that the choose function only does 
> >>>>> what's absolutely necessary.
> >>>>
> >>>> Allowing overrides of the bucket items too makes me nervous (do we have a 
> >>>> use for that yet?), 
> >>>
> >>> Maybe I misunderstood what you were after with bucket_id in http://pad.ceph.com/p/crush-multiweight around here ?
> >>>
> >>> struct crush_bucket_weight_set {
> >>>   __u32 bucket_id;  /* used as input to hash in place of bucket id */
> >>>   __u32 num_positions, num_items;
> >>>   __u32 *data; // index like [item*num_items+pos]
> >>> };
> >>
> >> It's just the bucket id that matters; the members of the bucket don't need 
> >> to change.  I'm not sure that the __u32 *ids above is needed for 
> >> anything... unless I'm misunderstanding something?
> > 
> > The code would look like this:
> > 
> > static int bucket_straw2_choose(const struct crush_bucket_straw2 *bucket,
> > 				int x, int r, const struct crush_choose_arg_list *arg_map, int position)
> > {
> >         struct crush_choose_arg_at_position *arg = get_straw2_choose_arg(bucket, arg_map, position);
> > 	unsigned int i, high = 0;
> > 	unsigned int u;
> > 	unsigned int w;
> >         unsigned int id;
> > 	__s64 ln, draw, high_draw = 0;
> > 
> > 	for (i = 0; i < bucket->h.size; i++) {
> > 		w = arg->weights[i];
> >                 id = arg->ids[i];
> >                 dprintk("weight 0x%x item %d\n", w, id);
> > 		if (w) {
> > 			u = crush_hash32_3(bucket->h.hash, x, id, r);
> > 
> > I don't see how changing the bucket id alone would work otherwise since it's not used.
> 
> Another reason for having __u32 *ids to remap each value in 
> bucket->items instead of having a single __u32 bucket_id is that it 
> allows us to remap both bucket ids and device ids.
> 
> Maybe we should just leave this id remapping thing for later since we 
> don't currently have a use case for it ? Implementing the weight set as 
> part of the workspace is straightforward and we can follow the same code 
> path at a later time if / when substituting bucket / device ids becomes 
> useful.

Oh, I misunderstood... I thought these were changing the actual bucket 
items, not just substituting ids for input in the placement hash.

There are several ways that I can imagine using it, but all of them are 
now covered by the fact that we can feed in alternative weight values.

The one exception is (oddly enough) the device classes.  We have a shadow 
hierarchy with different bucket ids.  This can also be accomplished by 
having alternate weight sets for each device class.  However, if the user 
starts using this cluster now, without requiring luminous clients, they 
are signing up for the alternate bucket ids, and if we want to eventually 
transition this feature to make use of the weight sets instead then we'll 
need the id remapping.

So I'm inclined to include it (as you originally proposed)...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html