On Mon, 21 Feb 2005 02:00:39 +0100, Sven Neumann <sven@xxxxxxxx> wrote: > > It sounds like the granularity of parallelism is too fine. That is, > > each "task" is too short and the overhead of task dispatching (your > > task queue processing, the kernels thread context switching, any IPC > > required, etc.) is longer then the duration of a single task. > > The task is not a single pixel but a single tile (that is usually a > region of 64x64 pixels). GIMP processes pixel regions by iterating > over the tiles. The multi-threaded pixel processor uses a configurable > number of threads. Each thread obtains a lock on the pixel-region, > takes a pointer to the next tile from the queue, releases the lock, > processes the tile and starts over. I maintain a threaded image processing library called VIPS. http://www.vips.ecs.soton.ac.uk/ We looked at granularity a while ago and the 'sweet spot' at which the thread start/stop time became insignificant seemed to be around 50x50 pixels. I've pasted some numbers to the end of the mail in case anyone is interested. I realise gimp is using a very different evaluation strategy, but the point (maybe) is that thread manipulation is rather quick and you're probably not seeing it with 64x64 pixel tiles. FWIW, vips works by having a thread pool (rather than a tile queue) and a simple for(;;) loop over tiles. At each tile, the for() loop waits for a thread to become free, then assigns it a tile to work on. The benchmark is a 45 degree rotate of a 4000 by 4000 pixel image. Look for the point at which real time stops falling. The first two arguments to "try" are the tilesize. The more recent numbers are really too small to be accurate :-( but the benchmark is 10 years old and took minutes back then: oh well. I'm supposed to be getting a quad opteron soon, which will be interesting. Kernel 2.6 would help too no doubt. cima: 1 cpu ultrasparc ./try huysum.hr.v fred.v 10 10 1 20 real 30.1 user 24.5 ./try huysum.hr.v fred.v 20 20 1 20 real 19.2 user 16.9 ./try huysum.hr.v fred.v 30 30 1 20 real 17.8 user 15.4 ./try huysum.hr.v fred.v 40 40 1 20 real 17.1 user 15.1 ./try huysum.hr.v fred.v 50 50 1 20 real 16.9 user 15.1 ./try huysum.hr.v fred.v 60 60 1 20 real 16.6 user 15.0 ./try huysum.hr.v fred.v 70 70 1 20 real 17.2 user 15.2 ./try huysum.hr.v fred.v 80 80 1 20 real 17.3 user 15.1 ./try huysum.hr.v fred.v 90 90 1 20 real 17.4 user 15.3 perugino: 2 cpu supersparc ./try huysum.hr.v fred.v 10 10 1 20 real 0m51.123s user 1m7.623s ./try huysum.hr.v fred.v 20 20 1 20 real 0m24.601s user 0m41.133s ./try huysum.hr.v fred.v 30 30 1 20 real 0m21.931s user 0m38.393s ./try huysum.hr.v fred.v 40 40 1 20 real 0m20.208s user 0m35.653s ./try huysum.hr.v fred.v 50 50 1 20 real 0m20.109s user 0m35.283s ./try huysum.hr.v fred.v 60 60 1 20 real 0m19.501s user 0m34.513s ./try huysum.hr.v fred.v 70 70 1 20 real 0m20.435s user 0m34.813s ./try huysum.hr.v fred.v 80 80 1 20 real 0m20.558s user 0m35.293s ./try huysum.hr.v fred.v 90 90 1 20 real 0m20.785s user 0m35.313s Run on furini, 2 CPU 450MHz PII Xeon, kernel 2.4.4, vips-7.7.19, gcc 2.95.3 ./try huysum.hr.v fred.v 10 10 1 20 real 0m4.542s user 0m4.350s sys 0m3.800s ./try huysum.hr.v fred.v 20 20 1 20 real 0m2.206s user 0m2.750s sys 0m1.250s ./try huysum.hr.v fred.v 30 30 1 20 real 0m1.678s user 0m2.610s sys 0m0.580s ./try huysum.hr.v fred.v 40 40 1 20 real 0m1.483s user 0m2.460s sys 0m0.410s ./try huysum.hr.v fred.v 50 50 1 20 real 0m1.443s user 0m2.330s sys 0m0.350s ./try huysum.hr.v fred.v 60 60 1 20 real 0m1.385s user 0m2.390s sys 0m0.220s ./try huysum.hr.v fred.v 70 70 1 20 real 0m1.394s user 0m2.460s sys 0m0.150s ./try huysum.hr.v fred.v 80 80 1 20 real 0m1.365s user 0m2.360s sys 0m0.200s ./try huysum.hr.v fred.v 90 90 1 20 real 0m1.393s user 0m2.450s sys 0m0.180s Run on manet, 2 CPU 2.5GHz P4 Xeon, kernel 2.4.18, vips-7.8.5, gcc 2.95.3 ./try huysum.hr.v fred.v 10 10 1 20 real 0m1.582s user 0m1.640s sys 0m1.470s ./try huysum.hr.v fred.v 20 20 1 20 real 0m0.691s user 0m0.970s sys 0m0.410s ./try huysum.hr.v fred.v 30 30 1 20 real 0m0.548s user 0m0.790s sys 0m0.230s ./try huysum.hr.v fred.v 40 40 1 20 real 0m0.489s user 0m0.790s sys 0m0.160s ./try huysum.hr.v fred.v 50 50 1 20 real 0m0.465s user 0m0.610s sys 0m0.180s ./try huysum.hr.v fred.v 60 60 1 20 real 0m0.454s user 0m0.740s sys 0m0.030s ./try huysum.hr.v fred.v 70 70 1 20 real 0m0.505s user 0m0.820s sys 0m0.120s ./try huysum.hr.v fred.v 80 80 1 20 real 0m0.479s user 0m0.840s sys 0m0.090s ./try huysum.hr.v fred.v 90 90 1 20 real 0m0.436s user 0m0.650s sys 0m0.040s Run on constable, 2 CPU 2.5GHz P4 Xeon, kernel 2.4.21, vips-7.10.8, gcc 3.3.1 ./try huysum.hr.v fred.v 10 10 1 20 real 0m1.544s user 0m1.420s sys 0m1.422s ./try huysum.hr.v fred.v 20 20 1 20 real 0m0.690s user 0m0.834s sys 0m0.441s ./try huysum.hr.v fred.v 30 30 1 20 real 0m0.494s user 0m0.658s sys 0m0.244s ./try huysum.hr.v fred.v 40 40 1 20 real 0m0.450s user 0m0.657s sys 0m0.174s ./try huysum.hr.v fred.v 50 50 1 20 real 0m0.397s user 0m0.579s sys 0m0.144s ./try huysum.hr.v fred.v 60 60 1 20 real 0m0.507s user 0m0.813s sys 0m0.123s ./try huysum.hr.v fred.v 70 70 1 20 real 0m0.381s user 0m0.573s sys 0m0.115s ./try huysum.hr.v fred.v 80 80 1 20 real 0m0.357s user 0m0.530s sys 0m0.101s ./try huysum.hr.v fred.v 90 90 1 20 real 0m0.528s user 0m0.877s sys 0m0.103s