RE: straw2-like algorithm

Artem Elkin <Artem.Elkin@xxxxxxxxxxxxxxxx> · Tue, 19 May 2015 12:26:01 +0000

Yes, current implementation is very optimized and tricky. But it uses two lookup table. So the question is what the system does between the choose function calls. In case there are some memory intensive operations between them the tables are likely to be displaced from the CPU caches. Another question is if this stuff has any visible impact on the overall performance (how often is it called?). I'm not aware of such details so you are the ones who should make the decision.

Regards,
Artem

-----Original Message-----
From: Gregory Farnum [mailto:greg@xxxxxxxxxxx] 
Sent: Saturday, May 16, 2015 1:28 AM
To: Artem Elkin
Cc: Ceph Development
Subject: Re: straw2-like algorithm

On Fri, May 15, 2015 at 2:33 PM, Artem Elkin <Artem.Elkin@xxxxxxxxxxxxxxxx> wrote:
> Hi,
>
> I've checked straw2 algorithm. It uses two 2Kb lookup tables to determine the log values. The exponential distribution could be achieved as a sum of squares of normally distributed numbers. And sum of multiple uniformly distributed values is a good approximation to the normal distribution. Event with 4 bytes sum the result is quite ok.
> I've tried this (the code is in C#, but I believe you'll catch the idea):
>
> public Item Select(Obj obj, out int iters) {
>     iters = 0;
>     while (true)
>     {
>         ++iters;
>         Item selectedItem = null;
>         long min = long.MaxValue;
>         for (int index = 0; index < Items.Count; index++)
>         {
>             var item = Items[index];
>             if (item.Weight == 0)
>                 continue;
>             var k = index + 1;
>             byte[] rnd1 = obj.GetPseudoRandom(iters * k);
>             byte[] rnd2 = obj.GetPseudoRandom(2 * iters + rnd1[0] * 
> k);
>
>             int sum1 = 2;//avarage sbyte is -0.5; move the center to 0
>             for (int i = 0; i < 4; i++)
>                 sum1 += (sbyte)rnd1[i];
>
>             int sum2 = 2;
>             for (int i = 0; i < 4; i++)
>                 sum2 += (sbyte)rnd2[i];
>
>             long rnd = sum1 * sum1 + sum2 * sum2; //approx exponential
>             rnd <<= 30;
>             rnd /= item.Weight;
>             if (rnd < min)
>             {
>                 min = rnd;
>                 selectedItem = item;
>             }
>         }
>
>         if (!selectedItem.disabled)
>             return selectedItem;
>     }
> }
> ( 
> https://github.com/artelk/CRUSH_Selection_Algorithm_Test/blob/master/P
> rogram.cs )
>
> With 10 items (with weights 1..2) within the bucket and 10^5 objects the object distribution is like this:
>
> Weight:1, w:0.0625, diff:-1.38%, count:6164 Weight:2, w:0.1250, 
> diff:-1.65%, count:12294 Weight:1, w:0.0625, diff:1.57%, count:6348 
> Weight:1, w:0.0625, diff:-0.18%, count:6239 Weight:2, w:0.1250, 
> diff:0.19%, count:12524 Weight:2, w:0.1250, diff:1.43%, count:12679 
> Weight:2, w:0.1250, diff:0.92%, count:12615 Weight:2, w:0.1250, 
> diff:0.77%, count:12596 Weight:1, w:0.0625, diff:0.59%, count:6287 
> Weight:2, w:0.1250, diff:-1.97%, count:12254
>
> So the deviation from the expected numbers is quite small. With 10^6 objects the deviation becomes lesser than 1%. I've tried to add more bucket items with different weights and tried to change the weights of arbitrary items. The amount of moved objects is very close to optimal one.
> What do you think of that?
>
> Btw, I've tried another algorithm. That is of O(1) complexity. But it requires a few seconds for initialization (this can be optimized) and creates a several megabyte array which is cached until the next changes in crush map and\or osd weights. Could you tell me if the selection algorithms (straw\tree\list\...) are called every time the object is read\written or there is some memory cached {pg_id->[osd list]} table that is used for the placement decision?

Maybe I'm misunderstanding what you're saying here, and it's definitely an interesting topic, but:

The current implementation is pretty heavily-optimized. If I understand and remember correctly, we are currently using two small lookup tables to do the calculations. We started out with the the sum of squares thing (mathematically at least) and then were using the standard methods to compute that, but it was slower than we wanted.
The full-sized lookup table was the next step (and we could encode it into the source if we wanted), but due to typical cache size tradeoffs the combined lookup table and computation we're doing now is usually faster.

At least, this is how I remember it. Somebody who actually did the work might be able to contradict me. ;)
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f