* Mel Gorman <mgorman@xxxxxxx> wrote: > While it is desirable that all threads in a process run on its home > node, this is not always possible or necessary. There may be more > threads than exist within the node or the node might over-subscribed > with unrelated processes. > > This can cause a situation whereby a page gets migrated off its home > node because the threads clearing pte_numa were running off-node. This > patch uses page->last_nid to build a two-stage filter before pages get > migrated to avoid problems with short or unlikely task<->node > relationships. > > Signed-off-by: Mel Gorman <mgorman@xxxxxxx> > --- > mm/mempolicy.c | 30 +++++++++++++++++++++++++++++- > 1 file changed, 29 insertions(+), 1 deletion(-) > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 4c1c8d8..fd20e28 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -2317,9 +2317,37 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long > } > > /* Migrate the page towards the node whose CPU is referencing it */ > - if (pol->flags & MPOL_F_MORON) > + if (pol->flags & MPOL_F_MORON) { > + int last_nid; > + > polnid = numa_node_id(); > > + /* > + * Multi-stage node selection is used in conjunction > + * with a periodic migration fault to build a temporal > + * task<->page relation. By using a two-stage filter we > + * remove short/unlikely relations. > + * > + * Using P(p) ~ n_p / n_t as per frequentist > + * probability, we can equate a task's usage of a > + * particular page (n_p) per total usage of this > + * page (n_t) (in a given time-span) to a probability. > + * > + * Our periodic faults will sample this probability and > + * getting the same result twice in a row, given these > + * samples are fully independent, is then given by > + * P(n)^2, provided our sample period is sufficiently > + * short compared to the usage pattern. > + * > + * This quadric squishes small probabilities, making > + * it less likely we act on an unlikely task<->page > + * relation. > + */ > + last_nid = page_xchg_last_nid(page, polnid); > + if (last_nid != polnid) > + goto out; > + } > + > if (curnid != polnid) > ret = polnid; > out: As mentioned in my other mail, this patch of yours looks very similar to the numa/core commit attached below, mostly written by Peter: 30f93abc6cb3 sched, numa, mm: Add the scanning page fault machinery Thanks, Ingo ---------------------> >From 30f93abc6cb3fd387a134d6b94ff5ac396be1c88 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> Date: Tue, 13 Nov 2012 12:58:32 +0100 Subject: [PATCH] sched, numa, mm: Add the scanning page fault machinery Add the NUMA working set scanning/hinting page fault machinery, with no policy yet. [ The earliest versions had the mpol_misplaced() function from Lee Schermerhorn - this was heavily modified later on. ] Also-written-by: Lee Schermerhorn <lee.schermerhorn@xxxxxx> Signed-off-by: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx> Cc: Rik van Riel <riel@xxxxxxxxxx> Cc: Mel Gorman <mgorman@xxxxxxx> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx> Cc: Hugh Dickins <hughd@xxxxxxxxxx> [ split it out of the main policy patch - as suggested by Mel Gorman ] Signed-off-by: Ingo Molnar <mingo@xxxxxxxxxx> --- include/linux/init_task.h | 8 +++ include/linux/mempolicy.h | 6 +- include/linux/mm_types.h | 4 ++ include/linux/sched.h | 41 ++++++++++++-- init/Kconfig | 73 +++++++++++++++++++----- kernel/sched/core.c | 15 +++++ kernel/sysctl.c | 31 ++++++++++- mm/huge_memory.c | 1 + mm/mempolicy.c | 137 ++++++++++++++++++++++++++++++++++++++++++++++ 9 files changed, 294 insertions(+), 22 deletions(-) [...] diff --git a/mm/mempolicy.c b/mm/mempolicy.c index d04a8a5..318043a 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2175,6 +2175,143 @@ static void sp_free(struct sp_node *n) kmem_cache_free(sn_cache, n); } +/* + * Multi-stage node selection is used in conjunction with a periodic + * migration fault to build a temporal task<->page relation. By + * using a two-stage filter we remove short/unlikely relations. + * + * Using P(p) ~ n_p / n_t as per frequentist probability, we can + * equate a task's usage of a particular page (n_p) per total usage + * of this page (n_t) (in a given time-span) to a probability. + * + * Our periodic faults will then sample this probability and getting + * the same result twice in a row, given these samples are fully + * independent, is then given by P(n)^2, provided our sample period + * is sufficiently short compared to the usage pattern. + * + * This quadric squishes small probabilities, making it less likely + * we act on an unlikely task<->page relation. + * + * Return the best node ID this page should be on, or -1 if it should + * stay where it is. + */ +static int +numa_migration_target(struct page *page, int page_nid, + struct task_struct *p, int this_cpu, + int cpu_last_access) +{ + int nid_last_access; + int this_nid; + + if (task_numa_shared(p) < 0) + return -1; + + /* + * Possibly migrate towards the current node, depends on + * task_numa_placement() and access details. + */ + nid_last_access = cpu_to_node(cpu_last_access); + this_nid = cpu_to_node(this_cpu); + + if (nid_last_access != this_nid) { + /* + * 'Access miss': the page got last accessed from a remote node. + */ + return -1; + } + /* + * 'Access hit': the page got last accessed from our node. + * + * Migrate the page if needed. + */ + + /* The page is already on this node: */ + if (page_nid == this_nid) + return -1; + + return this_nid; +} [...] -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>