Re: [Gimp-developer] GIMP and multiple processors

Tino Schwarze <gimp-developer.lists@xxxxxxx> · Mon, 21 Feb 2005 09:14:13 +0100

On Sun, Feb 20, 2005 at 11:52:16PM +0000, Adam D. Moss wrote:

> >I can force it to use both CPUs now, but even with
> >200% utilization it is 2s slower to run this stupid
> >ubenchmark than on 1 CPU without threads.
> 
> Just a vague guess, but the multiprocessor GIMP pixel
> work scheduler might* farm alternating tiles to alternating
> CPUs.  These are reasonably likely to have been allocated
> together and thus sit close together in memory, causing
> memory contention between CPUs.

So it might make sense not to interleave tiles but let each thread start
on another region of the image, like this (for 4 threads)

1xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxx
2xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxx
3xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxx
4xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxx

So each of the threads works on "it's own" memory at first and only when
it finished it's region, it will get tiles from the remaining work. Some
binary-search like algorithm could be used like this:

- figure out largest remaining part of the image (at start, it's the
whole image)
- let x be the number of idle threads
- if starting up, 
      divide into x regions
      assign each thread the start of it's region
   else 
      divide into x+1 regions
      assign idle threads the x+1st, x-th, x-1st etc. region (this
        leaves the first to the thread currently processing it)
- start processing the regions
- if any thread runs out of work, restart algorithm

This should improve memory locality while distributing work across all
threads and making sure that non-uniform workloads are also distributed.

Consider the above example where thread 4 has finished it's tiles:

1111xxxxxxxxxxxx
xxxxxxxxxxxxxxxx
22222222xxxxxxxx
xxxxxxxxxxxxxxxx
333333333333333x
xxxxxxxxxxxxxxxx
4444444444444444
4444444444444444

It looks like 1 will need the longest time to finish it's region, so we
jump in there and assign part of it's work to 4 (marked y):

1111xxxxxxxxxxxx
xx4yyyyyyyyyyyyy
22222222xxxxxxxx
xxxxxxxxxxxxxxxx
333333333333333x
xxxxxxxxxxxxxxxx
4444444444444444
4444444444444444

Later, 3 finishes:

11111111xxxxxxxx
xx44444yyyyyyyyy
2222222222222222
xxxxxxxxxxxxxxxx
3333333333333333
3333333333333333
4444444444444444
4444444444444444

Now, 2 has the largest region  left, so half of it get's assign to 3
(marked z):

11111111xxxxxxxx
xx44444yyyyyyyyy
2222222222222222
xxxxxxxx3zzzzzzz
3333333333333333
3333333333333333
4444444444444444
4444444444444444

The state of affairs can be managed easily since there are never more
regions than threads.

HTH! Tino.

PS: I don't know whether this algorithm works well in practice, I
actually got the idea while replying Adam's response.