Re: [RFC 1/2] workqueue: use the nearest NUMA node, not the local one

Nish Aravamudan <nish.aravamudan@xxxxxxxxx> · Fri, 18 Jul 2014 10:33:41 -0700

[ Apologies for replying from a different address, we have a service outage at work. ]

On Fri, Jul 18, 2014 at 1:11 AM, Lai Jiangshan <laijs@xxxxxxxxxxxxxx> wrote:

>
> Hi,

Thank you for your response!

> I'm curious about what will it happen when alloc_pages_node(memoryless_node).

alloc_pages_node() is only involved in one of the possible paths (maybe this occurs on x86 with THREAD_INFO > PAGE_SIZE?) On powerpc, though, that's not the case.

Details: 

1. pool->node is used in the invocation of kthread_create_on_node() in create_worker().
2. kthread_create_on_node sets up a struct kthread_create_info with create->node = node and wakes up kthreadd.

3. kthreadd calls create_kthread, which sets current->pref_node_fork = create->node.
4. dup_task_struct() calls node = tsk_fork_get_node() before invoking alloc_task_struct_node(node) and alloc_thread_info_node(node).

5. tsk_fork_get_node() returns current->pref_node_fork for kthreadd.
6. alloc_task_struct_node() calls kmem_cache_alloc_node(,GFP_KERNEL, node)
7. alloc_thread_info_node() either calls kmem_cache_alloc_node(,GFP_KERNEL, node) or alloc_kmem_pages_node(node,GFP_KERNEL,), depending on the size of THREAD_INFO relative to PAGE_SIZE.

8a. alloc_kmem_pages_node() -> alloc_pages_node -> __alloc_pages with a zonelist built from node_zonelist. This should lead to proper fallback.
8b. kmem_cache_alloc_node() calls slab_alloc_node()

9. For a memoryless node, we will trigger the following:

        if (unlikely(!object || !node_match(page, node))) {
                object = __slab_alloc(s, gfpflags, node, addr, c);
                stat(s, ALLOC_SLOWPATH);

        }

10. __slab_alloc() in turn will:

        if (unlikely(!node_match(page, node))) {
                stat(s, ALLOC_NODE_MISMATCH);
                deactivate_slab(s, page, c->freelist);

                c->page = NULL;
                c->freelist = NULL;
                goto new_slab;
        }

deactivating the slab. Thus, every kthread created with a node specification leads to a single object on a slab. We see an explosion in the slab consumption, all of which is unreclaimable.

Anton originally proposed not deactivating slabs when we *know* the allocation will be remote (i.e., from a memoryless node). Joonsoo and Christoph disagreed with this and proposed alternative solutions, which weren't agreed upon at the time.

> If the memory is allocated from the most preferable node for the @memoryless_node,
> why we need to bother and use cpu_to_mem() in the caller site?

The reason is that the node passed is a hint into the MM subsystem of what node we want memory to come from. Well, I take that back, I think semantically there are two ways to interpret the node parameter:

1) The NUMA node we want memory from
2) The NUMA node we expect memory from

The path through the MM above sort of conflates the two, the caller specified an impossible request (which the MM subsystem technically knows but which knowledge it isn't using at this point) of memory from a node that has none.

We could change the core MM to do better in the presence of memoryless nodes, and should, but this seems far less invasive and does the right thing. Semantically, I think the workqueue's pool->node is meant to be the node from which we want memory allocated, which is the node with memory closest to the CPU.

Thanks,
Nish

> If not, why the memory allocation subsystem refuses to find a preferable node
> for @memoryless_node in this case? Does it intend on some purpose or
> it can't find in some cases?

>
> Thanks,
> Lai
>
> Added CC to Tejun (workqueue maintainer).
>
> On 07/18/2014 07:09 AM, Nishanth Aravamudan wrote:
> > In the presence of memoryless nodes, the workqueue code incorrectly uses

> > cpu_to_node() to determine what node to prefer memory allocations come
> > from. cpu_to_mem() should be used instead, which will use the nearest
> > NUMA node with memory.
> >
> > Signed-off-by: Nishanth Aravamudan <nacc@xxxxxxxxxxxxxxxxxx>

> >
> > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > index 35974ac..0bba022 100644
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -3547,7 +3547,12 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)

> >               for_each_node(node) {
> >                       if (cpumask_subset(pool->attrs->cpumask,
> >                                          wq_numa_possible_cpumask[node])) {
> > -                             pool->node = node;

> > +                             /*
> > +                              * We could use local_memory_node(node) here,
> > +                              * but it is expensive and the following caches

> > +                              * the same value.
> > +                              */
> > +                             pool->node = cpu_to_mem(cpumask_first(pool->attrs->cpumask));

> >                               break;
> >                       }
> >               }
> > @@ -4921,7 +4926,7 @@ static int __init init_workqueues(void)
> >                       pool->cpu = cpu;

> >                       cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
> >                       pool->attrs->nice = std_nice[i++];
> > -                     pool->node = cpu_to_node(cpu);

> > +                     pool->node = cpu_to_mem(cpu);
> >
> >                       /* alloc pool ID */
> >                       mutex_lock(&wq_pool_mutex);
> >
> > --

> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

> > Please read the FAQ at  http://www.tux.org/lkml/
> >
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

> Please read the FAQ at  http://www.tux.org/lkml/