Re: [RFC 0/2] Memoryless nodes and kworker

Nish Aravamudan <nish.aravamudan@xxxxxxxxx> · Fri, 18 Jul 2014 11:47:08 -0700

On Fri, Jul 18, 2014 at 11:19 AM, Tejun Heo <tj@xxxxxxxxxx> wrote:
>
> Hello,
>
> On Fri, Jul 18, 2014 at 11:12:01AM -0700, Nish Aravamudan wrote:

> > why aren't these callers using kthread_create_on_cpu()? That API was
>
> It is using that.  There just are other data structures too.

Sorry, I might not have been clear.

Why are any callers of the format kthread_create_on_node(..., cpu_to_node(cpu), ...) not using kthread_create_on_cpu(..., cpu, ...)?

In total in Linus' tree, there are only two APIs that use kthread_create_on_cpu() -- smpboot_create_threads() and smpboot_register_percpu_thread(). Neither of those seem to be used by the workqueue code that I can see as of yet.

> > already change to use cpu_to_mem() [so one change, rather than of all over
> > the kernel source]. We could change it back to cpu_to_node and push down
> > the knowledge about the fallback.

>
> And once it's properly solved, please convert back kthread to use
> cpu_to_node() too.  We really shouldn't be sprinkling the new subtly
> different variant across the kernel.  It's wrong and confusing.

I understand what you mean, but it's equally wrong for the kernel to be wasting GBs of slab. Different kinds of wrongness :)

> > Yes, this is a good point. But honestly, we're not really even to the point

> > of talking about fallback here, at least in my testing, going off-node at
> > all causes SLUB-configured slabs to deactivate, which then leads to an
> > explosion in the unreclaimable slab.

>
> I don't think moving the logic inside allocator proper is a huge
> amount of work and this isn't the first spillage of this subtlety out
> of allocator proper.  Fortunately, it hasn't spread too much yet.

> Let's please stop it here.  I'm not saying you shouldn't or can't fix
> the off-node allocation.

It seems like an additional reasonable approach would be to provide a suitable _cpu() API for the allocators. I'm not sure why saying that callers should know about NUMA (in order to call cpu_to_node() in every caller) is any better than saying that callers should know about memoryless nodes (in order to call cpu_to_mem() in every caller instead) -- when at least in several cases that I've seen the relevant data is what CPU we're expecting to run or are running on. Seems like the _cpu API would specify -- please allocate memory local to this CPU, wherever it is?

Thanks,
Nish