On Wed, Jul 31, 2013 at 12:48:14PM +0200, Peter Zijlstra wrote: > On Wed, Jul 31, 2013 at 11:30:52AM +0100, Mel Gorman wrote: > > I'm not sure I understand your point. The scan rate is decreased again if > > the page is found to be properly placed in the future. It's in the next > > hunk you modify although the periodically reset comment is now out of date. > > Yeah its because of the next hunk. I figured that if we don't lower it, > we shouldn't raise it either. > hmm, I'm going to punt that to a TODO item and think about it some more with a fresh head. > > > @@ -1167,10 +1171,20 @@ void task_numa_fault(int last_nidpid, in > > > /* > > > * If pages are properly placed (did not migrate) then scan slower. > > > * This is reset periodically in case of phase changes > > > - */ > > > - if (!migrated) > > > + * > > > + * APZ: it seems to me that one can get a ton of !migrated faults; > > > + * consider the scenario where two threads fight over a shared memory > > > + * segment. We'll win half the faults, half of that will be local, half > > > + * of that will be remote. This means we'll see 1/4-th of the total > > > + * memory being !migrated. Using a fixed increment will completely > > > + * flatten the scan speed for a sufficiently large workload. Another > > > + * scenario is due to that migration rate limit. > > > + * > > > + if (!migrated) { > > > p->numa_scan_period = min(p->numa_scan_period_max, > > > p->numa_scan_period + jiffies_to_msecs(10)); > > > + } > > > + */ > > > > FWIW, I'm also not happy with how the scan rate is reduced but did not > > come up with a better alternative that was not fragile or depended on > > gathering too much state. Granted, I also have not been treating it as a > > high priority problem. > > Right, so what Ingo did is have the scan rate depend on the convergence. > What exactly did you dislike about that? > It depended entirely on properly detecting if we are converged or not. As things like false share detection within THP is still not there I was worried that it was too easy to make the wrong decision here and keep it pinned at the maximum scan rate. > We could define the convergence as all the faults inside the interleave > mask vs the total faults, and then run at: min + (1 - c)*(max-min). > And when we have such things properly in place then I think we can kick away the current crutch. > > > +#if 0 > > > /* > > > * We do not care about task placement until a task runs on a node > > > * other than the first one used by the address space. This is > > > * largely because migrations are driven by what CPU the task > > > * is running on. If it's never scheduled on another node, it'll > > > * not migrate so why bother trapping the fault. > > > + * > > > + * APZ: seems like a bad idea for pure shared memory workloads. > > > */ > > > if (mm->first_nid == NUMA_PTE_SCAN_INIT) > > > mm->first_nid = numa_node_id(); > > > > At some point in the past scan starts were based on waiting a fixed interval > > but that seemed like a hack designed to get around hurting kernel compile > > benchmarks. I'll give it more thought and see can I think of a better > > alternative that is based on an event but not this event. > > Ah, well the reasoning on that was that all this NUMA business is > 'expensive' so we'd better only bother with tasks that persist long > enough for it to pay off. > Which is fair enough but tasks that lasted *just* longer than the interval still got punished. Processes running with a slightly slower CPU gets hurts meaning that it would be a difficult bug report to digest. > In that regard it makes perfect sense to wait a fixed amount of runtime > before we start scanning. > > So it was not a pure hack to make kbuild work again.. that is did was > good though. > Maybe we should reintroduce the delay then but I really would prefer that it was triggered on some sort of event. > > > @@ -1254,9 +1272,14 @@ void task_numa_work(struct callback_head > > > * Do not set pte_numa if the current running node is rate-limited. > > > * This loses statistics on the fault but if we are unwilling to > > > * migrate to this node, it is less likely we can do useful work > > > - */ > > > + * > > > + * APZ: seems like a bad idea; even if this node can't migrate anymore > > > + * other nodes might and we want up-to-date information to do balance > > > + * decisions. > > > + * > > > if (migrate_ratelimited(numa_node_id())) > > > return; > > > + */ > > > > > > > Ingo also disliked this but I wanted to avoid a situation where the > > workload suffered because of a corner case where the interconnect was > > filled with migration traffic. > > Right, but you already rate limit the actual migrations, this should > leave enough bandwidth to allow the non-migrating scanning. > > I think its important we keep up-to-date information if we're going to > do placement based on it. > Ok, you convinced me. I slapped a changelog on it that is a cut&paste job and moved it earlier in the series. > On that rate-limit, this looks to be a hard-coded number unrelated to > the actual hardware. Guesstimate. > I think we should at the very least make it a > configurable number and preferably scale the number with the SLIT info. > Or alternatively actually measure the node to node bandwidth. > Ideally we should just kick it away because scan rate limiting works properly. Lets not make it a tunable just yet so we can avoid having to deprecate it later. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>