Re: NUMA performance comparison between three NUMA kernels and mainline. [Mid-size NUMA system edition.]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



* Mel Gorman <mgorman@xxxxxxx> wrote:

> > NUMA convergence latency measurements
> > -------------------------------------
> > 
> > 'NUMA convergence' latency is the number of seconds a 
> > workload takes to reach 'perfectly NUMA balanced' state. 
> > This is measured on the CPU placement side: once it has 
> > converged then memory typically follows within a couple of 
> > seconds.
> 
> This is a sortof misleading metric so be wary of it as the 
> speed a workload converges is not necessarily useful. It only 
> makes a difference for short-lived workloads or during phase 
> changes. If the workload is short-lived, it's not interesting 
> anyway. If the workload is rapidly changing phases then the 
> migration costs can be a major factor and rapidly converging 
> might actually be slower overall.
> 
> The speed the workload converges will depend very heavily on 
> when the PTEs are marked pte_numa and when the faults are 
> incurred. If this is happening very rapidly then a workload 
> will converge quickly *but* this can incur a high system CPU 
> cost (PTE scanning, fault trapping etc).  This metric can be 
> gamed by always scanning rapidly but the overall performance 
> may be worse.
> 
> I'm not saying that this metric is not useful, it is. Just be 
> careful of optimising for it. numacores system CPU usage has 
> been really high in a number of benchmarks and it may be 
> because you are optimising to minimise time to convergence.

You are missing a big part of the NUMA balancing picture here: 
the primary use of 'latency of convergence' is to determine 
whether a workload converges *at all*.

For example if you look at the 4-process / 8-threads-per-process 
latency results:

                            [ Lower numbers are better. ]
 
  [test unit]            :   v3.7 |balancenuma-v10|  AutoNUMA-v28 |   numa-u-v3   |
 ------------------------------------------------------------------------------------------
  4x8-convergence        :  101.1 |         101.3 |           3.4 |           3.9 |  secs

You'll see that balancenuma does not converge this workload. 

Where does such a workload matter? For example in the 4x JVM 
SPECjbb tests that Thomas Gleixner has reported today:

    http://lkml.org/lkml/2012/12/10/437

There balancenuma does worse than AutoNUMA and the -v3 tree 
exactly because it does not NUMA-converge as well (or at all).

> I'm trying to understand what you're measuring a bit better.  
> Take 1x4 for example -- one process, 4 threads. If I'm reading 
> this description then all 4 threads use the same memory. Is 
> this correct? If so, this is basically a variation of numa01 
> which is an adverse workload. [...]

No, 1x4 and 1x8 are like the SPECjbb JVM tests you have been 
performing - not an 'adverse' workload. The threads of the JVM 
are sharing memory significantly enough to justify moving them 
on the same node.

> [...]  balancenuma will not migrate memory in this case as 
> it'll never get past the two-stage filter. If there are few 
> threads, it might never get scheduled on a new node in which 
> case it'll also do nothing.
> 
> The correct action in this case is to interleave memory and 
> spread the tasks between nodes but it lacks the information to 
> do that. [...]

No, the correct action is to move related threads close to each 
other.

> [...] This was deliberate as I was expecting numacore or 
> autonuma to be rebased on top and I didn't want to collide.
> 
> Does the memory requirement of all threads fit in a single 
> node? This is related to my second question -- how do you 
> define convergence?

NUMA-convergence is to achieve the ideal CPU and memory 
placement of tasks.

> > The 'balancenuma' kernel does not converge any of the 
> > workloads where worker threads or processes relate to each 
> > other.
> 
> I'd like to know if it is because the workload fits on one 
> node. If the buffers are all really small, balancenuma would 
> have skipped them entirely for example due to this check
> 
>         /* Skip small VMAs. They are not likely to be of relevance */
>         if (((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) < HPAGE_PMD_NR)
>                 continue;

No, the memory areas are larger than 2MB.

> Another possible explanation is that in the 4x4 case that the 
> processes threads are getting scheduled on separate nodes. As 
> each thread is sharing data it would not get past the 
> two-stage filter.
> 
> How realistic is it that threads are accessing the same data? 

In practice? Very ...

> That looks like it would be a bad idea even from a caching 
> perspective if the data is being updated. I would expect that 
> the majority of HPC workloads would have each thread accessing 
> mostly private data until the final stages where the results 
> are aggregated together.

You tested such a workload many times in the past: the 4x JVM 
SPECjbb test ...

> > NUMA workload bandwidth measurements
> > ------------------------------------
> > 
> > The other set of numbers I've collected are workload 
> > bandwidth measurements, run over 20 seconds. Using 20 
> > seconds gives a healthy mix of pre-convergence and 
> > post-convergence bandwidth,
> 
> 20 seconds is *really* short. That might not even be enough 
> time for autonumas knumad thread to find the process and 
> update it as IIRC it starts pretty slowly.

If you check the convergence latency tables you'll see that 
AutoNUMA is able to converge within 20 seconds.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>


[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]