Re: [RFC] AutoNUMA alpha6

Ingo Molnar <mingo@xxxxxxxxxx> · Wed, 21 Mar 2012 08:53:49 +0100

* Dan Smith <danms@xxxxxxxxxx> wrote:

> On your numa01 test:
> 
>   Autonuma is 22% faster than mainline
>   Numasched is 42% faster than mainline
> 
> On Peter's modified stream_d test:
> 
>   Autonuma is 35% *slower* than mainline
>   Numasched is 55% faster than mainline
> 
> I know that the "real" performance guys here are going to be 
> posting some numbers from more interesting benchmarks soon, 
> but since nobody had answered Andrea's question, I figured I'd 
> do it.

It would also be nice to find and run *real* HPC workloads that 
were not written by Andrea or Peter and which computes something 
non-trivial and real - and then compare the various methods.

Ideally we'd like to measure the two conceptual working set 
corner cases:

  - global working set HPC with a large shared working set:

      - Many types of Monte-Carlo optimizations tend to be
        like this - they have a large shared time series and
        threads compute on those with comparatively little
        private state.

      - 3D rendering with physical modelling: a large, complex
        3D scene set with private worker threads. (much of this 
        tends to be done in GPUs these days though.)

  - private working set HPC with little shared/global working 
    set and lots of per process/thread private memory 
    allocations:

      - Quantum chemistry optimization runs tend to be like this
        with their often gigabytes large matrices.

      - Gas, fluid, solid state and gravitational particle
        simulations - most ab initio methods tend to have very
        little global shared state, each thread iterates its own
        version of the universe.

      - More complex runs of ray tracing as well IIRC.

My impression is that while threading is on the rise due to its 
ease of use, many threaded HPC workloads still fall into the 
second category.

In fact they are often explicitly *turned* into the second 
category at the application level by duplicating shared global 
data explicitly and turning it into per thread local data.

So we need to cover these major HPC usecases - we won't merge 
any of this based on just synthetic benchmarks.

And to default-enable any of this on stock kernels we'd need to 
even more testing and widespread, feel-good speedups in almost 
every key Linux workload... I don't see that happening though, 
so the best we can get are probably some easy and flexible knobs 
for HPC.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>