On Sun, Feb 20, 2005 at 11:52:16PM +0000, Adam D. Moss wrote: > >I can force it to use both CPUs now, but even with > >200% utilization it is 2s slower to run this stupid > >ubenchmark than on 1 CPU without threads. > > Just a vague guess, but the multiprocessor GIMP pixel > work scheduler might* farm alternating tiles to alternating > CPUs. These are reasonably likely to have been allocated > together and thus sit close together in memory, causing > memory contention between CPUs. So it might make sense not to interleave tiles but let each thread start on another region of the image, like this (for 4 threads) 1xxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx 2xxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx 3xxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx 4xxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx So each of the threads works on "it's own" memory at first and only when it finished it's region, it will get tiles from the remaining work. Some binary-search like algorithm could be used like this: - figure out largest remaining part of the image (at start, it's the whole image) - let x be the number of idle threads - if starting up, divide into x regions assign each thread the start of it's region else divide into x+1 regions assign idle threads the x+1st, x-th, x-1st etc. region (this leaves the first to the thread currently processing it) - start processing the regions - if any thread runs out of work, restart algorithm This should improve memory locality while distributing work across all threads and making sure that non-uniform workloads are also distributed. Consider the above example where thread 4 has finished it's tiles: 1111xxxxxxxxxxxx xxxxxxxxxxxxxxxx 22222222xxxxxxxx xxxxxxxxxxxxxxxx 333333333333333x xxxxxxxxxxxxxxxx 4444444444444444 4444444444444444 It looks like 1 will need the longest time to finish it's region, so we jump in there and assign part of it's work to 4 (marked y): 1111xxxxxxxxxxxx xx4yyyyyyyyyyyyy 22222222xxxxxxxx xxxxxxxxxxxxxxxx 333333333333333x xxxxxxxxxxxxxxxx 4444444444444444 4444444444444444 Later, 3 finishes: 11111111xxxxxxxx xx44444yyyyyyyyy 2222222222222222 xxxxxxxxxxxxxxxx 3333333333333333 3333333333333333 4444444444444444 4444444444444444 Now, 2 has the largest region left, so half of it get's assign to 3 (marked z): 11111111xxxxxxxx xx44444yyyyyyyyy 2222222222222222 xxxxxxxx3zzzzzzz 3333333333333333 3333333333333333 4444444444444444 4444444444444444 The state of affairs can be managed easily since there are never more regions than threads. HTH! Tino. PS: I don't know whether this algorithm works well in practice, I actually got the idea while replying Adam's response.