Re: [RFC PATCH] topology: Represent clusters of CPUs within a die.

Valentin Schneider <valentin.schneider@xxxxxxx> · Mon, 19 Oct 2020 16:51:06 +0100

On 19/10/20 15:27, Jonathan Cameron wrote:
> On Mon, 19 Oct 2020 14:48:02 +0100
> Valentin Schneider <valentin.schneider@xxxxxxx> wrote:
>>
>> That's my queue to paste some of that stuff I've been rambling on and off
>> about!
>>
>> With regards to cache / interconnect layout, I do believe that if we
>> want to support in the scheduler itself then we should leverage some
>> distance table rather than to create X extra scheduler topology levels.
>>
>> I had a chat with Jeremy on the ACPI side of that sometime ago. IIRC given
>> that SLIT gives us a distance value between any two PXM, we could directly
>> express core-to-core distance in that table. With that (and if that still
>> lets us properly discover NUMA node spans), we could let the scheduler
>> build dynamic NUMA-like topology levels representing the inner quirks of
>> the cache / interconnect layout.
>
> You would rapidly run into the problem SLIT had for numa node description.
> There is no consistent description of distance and except in the vaguest
> sense or 'nearer' it wasn't any use for anything.   That is why HMAT
> came along. It's far from perfect but it is a step up.
>

I wasn't aware of HMAT; my feeble ACPI knowledge is limited to SRAT / SLIT
/ PPTT, so thanks for pointing this out.

> I can't see how you'd generalize those particular tables to do anything
> for intercore comms without breaking their use for NUMA, but something
> a bit similar might work.
>

Right, there's the issue of still being able to determine NUMA node
boundaries.

> A lot of thought has gone in (and meeting time) to try an improve the
> situation for complex topology around NUMA.  Whilst there are differences
> in representing the internal interconnects and caches it seems like a somewhat
> similar problem.  The issue there is it is really really hard to describe
> this stuff with enough detail to be useful, but simple enough to be usable.
>
> https://lore.kernel.org/linux-mm/20181203233509.20671-1-jglisse@xxxxxxxxxx/
>

Thanks for the link!

>>
>> It's mostly pipe dreams for now, but there seems to be more and more
>> hardware where that would make sense; somewhat recently the PowerPC guys
>> added something to their arch-specific code in that regards.
>
> Pipe dream == something to work on ;)
>
> ACPI has a nice code first model of updating the spec now, so we can discuss
> this one in public, and propose spec changes only once we have an implementation
> proven.
>

FWIW I blabbered about a "generalization" of NUMA domains & distances
within the scheduler at LPC19 (and have been pasting that occasionally,
apologies for the broken record):

https://linuxplumbersconf.org/event/4/contributions/484/

I've only pondered about the implementation, but if (big if; also I really
despise advertising "the one solution that will solve all your issues"
which this is starting to sound like) it would help I could cobble together
an RFC leveraging a separate distance table.

It doesn't solve the "funneling cache properties into a single number"
issue, which as you just pointed out in a parallel email is a separate
discussion altogether.

> Note I'm not proposing we put the cluster stuff in the scheduler, just
> provide it as a hint to userspace.
>

The goal being to tweak tasks' affinities, right? Other than CPU pinning
and rare cases, IMO if the userspace has to mess around with affinities it
is due to the failings of the underlying scheduler. Restricted CPU
affinities is also something the load-balancer struggles with; I have and
have been fighting over such issues where just a single per-CPU kworker
waking up at the wrong time can mess up load-balance for quite some time. I
tend to phrase it as: "if you're rude to the scheduler, it can and will
respond in kind".

Now yes, it's not the same timescale nor amount of work, but this is
something the scheduler itself should leverage, not userspace.

> Jonathan