Re: CRUSH algorithm and its debug method

Ning Yao <zay11022@xxxxxxxxx> · Thu, 6 Oct 2016 11:52:29 +0800

Hi,

2016-10-03 21:22 GMT+08:00 Sage Weil <sage@xxxxxxxxxxxx>:
> Hi!
>
> On Sat, 1 Oct 2016, Ning Yao wrote:
>> Hi, Sage
>>
>> I find that several issues related to current CRUSH algorithm as below:
>>
>> 1. It is possible to select out the same collision and retry bucket in
>> a crush_choose_firstn() loop. (e.g.  when we set reweight to 0 or mark
>> osd out, it would be definitely rejected if it is selected. However,
>> when the second chance to select out another one based on the
>> different r', it is still possible to select out the same osd
>> previously rejected, right? And until a different one is selected
>> after several retries.).  I think we can record those rejected or
>> collision osds in the same loop so that the process can be converged
>> much faster?
>
> It's sitll possible to pick the same thing with a different r'.
> That's why the max retries values has to be reasonably high.  I'm not sure
> how you would avoid doing the full calculation, though... can you be
> more specific?
I have two ideas to do that:

1) the simplest idea is allocate a bool array in crush_choose_firstn()
whose size is crush_bucket->size. And then pass the array pointer into
crush_bucket_choose(). What we shoud do is:
    i) mark those already selected, collision or rejected items as true
    ii) when determine (hight and high_draw in bucket_straw_choose or
bucket_straw2_choose), we can skip those items already marked TRUE.
    But the drawback is that  we always need to allocate an dynamic
array within each iteration, And memory allocation in heap is an
expensive operation.

2) So the second idea is to use a uint64_t s_mask (said selected_mask
to indicate those items already been selected). What we should do is:
    i) init uint64_t s_mask = 0; in crush_choose_firstn()
    ii) pass value by reference into  bucket_straw_choose or
bucket_straw2_choose and mark the corresponding bit as one if item is
selected, collision or rejected.  s_mask  = s_mask | (1ul << high)
    iii) do not consider those whose bit (s_mask & (1ul << high)) is
marked as one.
    But here is an explicit limitation, the number of items in the
bucket should be smaller than sizeof(int64_t). So we need to run the
previous strategy if crush_bucket->size > 64.

Furthermore, the strategy is just suitable for straw algorithm, so we
need to remain the strategy for tree, list and uniform algorithm.

Any other better ideas? @Wei-Chung Cheng

>> 2. Currently, the reweight params in crushmap is memoryless (e.g we
>> balance our data by reducing reweight, which will be lost after this
>> osd DOWN and OUT automatically. And we mark its IN again because
>> currently ceph osd in directly marks the reweight to 1.0 and out marks
>> the reweight to 0.0).  It is quite awkward when we use ceph osd
>> reweight-by-utilization to make data balance (If some osds down and
>> out, our previous effort is totally lost).   So I think marking osd
>> "in"  does not simply modify reweight to "1.0". Actually, we can
>> iteration the previous osdmap and find out the value of the reweight
>> or records it anywhere we can retrieve this value again?
>
> The old value is stored here
>
> https://github.com/ceph/ceph/blob/master/src/osd/OSDMap.h#L89
>
> and restored when the OSD is marked back up, although IIRC there is a
> config option that controls when the old value is stored (it might only
> happen when the osd is marked out automatically, not when it is done
> manually?).  That behavior could be changed, though.
Thanks, fix in PR #11293 is what we expect.

>> 3. Currently, there is no debug option in the mapping progress in
>> Mapper.c. dprintk is default disabled so that it will be hard to dig
>> into the algorithm if something unexpected result happens. I think we
>> can introduce the debug options and output the debug information when
>> we use "ceph osd map xxx xxx" so that it is much more easier to find
>> the shortness in current mapping process?
>
> Yeah, it's hard to debug.  I usually uncomment the dprintk define and
> rebuild osdmaptool, which has a --test-map-pg option so that I can run
> a specific problematic mapping.  We could do something a bit more clever
> as long as it is a simple conditional--we don't want to slow down the
> mapping code as it is performance sensitive.

Yeah, I think it is enough to print out the debug information when we
run "ceph osd map data object_1". So we can pass ostream* into the
function, gathering the information. Normally, we can default pass the
ostream*  as a NULL pointer so that we can skip all of the dprintk?

> Thanks!
> sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html