Re: CRUSH algorithm and its debug method

Wei-Chung Cheng <freeze.vicente.cheng@xxxxxxxxx> · Wed, 5 Oct 2016 15:55:23 +0800

2016-10-03 21:22 GMT+08:00 Sage Weil <sage@xxxxxxxxxxxx>:
> Hi!
>
> On Sat, 1 Oct 2016, Ning Yao wrote:
>> Hi, Sage
>>
>> I find that several issues related to current CRUSH algorithm as below:
>>
>> 1. It is possible to select out the same collision and retry bucket in
>> a crush_choose_firstn() loop. (e.g.  when we set reweight to 0 or mark
>> osd out, it would be definitely rejected if it is selected. However,
>> when the second chance to select out another one based on the
>> different r', it is still possible to select out the same osd
>> previously rejected, right? And until a different one is selected
>> after several retries.).  I think we can record those rejected or
>> collision osds in the same loop so that the process can be converged
>> much faster?
>
> It's sitll possible to pick the same thing with a different r'.
> That's why the max retries values has to be reasonably high.  I'm not sure
> how you would avoid doing the full calculation, though... can you be
> more specific?
>

Hi all,
I also have the same question and I already do a simple version for
skipping the item that is already selected. (now ONLY for straw(2)
algorithm)

I wanna to know why we don't skip the selected item?
I thought that behavior would try to avoid choose the not suitable
item in specific scenario (like the weight of 2 item with huge gap)

thanks!

>> 2. Currently, the reweight params in crushmap is memoryless (e.g we
>> balance our data by reducing reweight, which will be lost after this
>> osd DOWN and OUT automatically. And we mark its IN again because
>> currently ceph osd in directly marks the reweight to 1.0 and out marks
>> the reweight to 0.0).  It is quite awkward when we use ceph osd
>> reweight-by-utilization to make data balance (If some osds down and
>> out, our previous effort is totally lost).   So I think marking osd
>> "in"  does not simply modify reweight to "1.0". Actually, we can
>> iteration the previous osdmap and find out the value of the reweight
>> or records it anywhere we can retrieve this value again?
>
> The old value is stored here
>
> https://github.com/ceph/ceph/blob/master/src/osd/OSDMap.h#L89
>
> and restored when the OSD is marked back up, although IIRC there is a
> config option that controls when the old value is stored (it might only
> happen when the osd is marked out automatically, not when it is done
> manually?).  That behavior could be changed, though.
>
>> 3. Currently, there is no debug option in the mapping progress in
>> Mapper.c. dprintk is default disabled so that it will be hard to dig
>> into the algorithm if something unexpected result happens. I think we
>> can introduce the debug options and output the debug information when
>> we use "ceph osd map xxx xxx" so that it is much more easier to find
>> the shortness in current mapping process?
>
> Yeah, it's hard to debug.  I usually uncomment the dprintk define and
> rebuild osdmaptool, which has a --test-map-pg option so that I can run
> a specific problematic mapping.  We could do something a bit more clever
> as long as it is a simple conditional--we don't want to slow down the
> mapping code as it is performance sensitive.
>
> Thanks!
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html