Re: [patch V4 4/8] sched: Make migrate_disable/enable() independent of RT

Andy Lutomirski <luto@xxxxxxxxxx> · Sun, 22 Nov 2020 15:16:12 -0800

On Fri, Nov 20, 2020 at 1:29 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>

> As already stated, per-cpu page-tables would allow for a much saner kmap
> approach, but alas, x86 really can't sanely do that (the archs that have
> separate kernel and user page-tables could do this, and how we cursed
> x86 didn't have that when meltdown happened).
>
> [ and using fixmaps in the per-cpu memory space _could_ work, but is a
>   giant pain because then all accesses need GS prefix and blah... ]
>
> And I'm sure there's creative ways for other problems too, but yes, it's
> hard.
>
> Anyway, clearly I'm the only one that cares, so I'll just crawl back
> under my rock...

I'll poke my head out of the rock for a moment, though...

Several years ago, we discussed (in person at some conference IIRC)
having percpu pagetables to get sane kmaps, percpu memory, etc.  The
conclusion was that Linus thought the performance would suck and we
shouldn't do it.  Since then, though, we added really fancy
infrastructure for keeping track of a per-CPU list of recently used
mms and efficiently tracking when they need to be invalidated.  We
called these "ASIDs".  It would be fairly straightforward to have an
entire pgd for each (cpu, asid) pair.  Newly added second-level
(p4d/pud/whatever -- have I ever mentioned how much I dislike the
Linux pagetable naming conventions and folding tricks?) tables could
be lazily faulted in, and copies of the full 2kB mess would only be
neeced when a new (cpu,asid) is allocated because either a flush
happened while the mm was inactive on the CPU in question or because
the mm fell off the percpu cache.

The total overhead would be a bit more cache usage, 4kB * num cpus *
num ASIDs per CPU (or 8k for PTI), and a few extra page faults (max
num cpus * 256 per mm over the entire lifetime of that mm).  The
common case of a CPU switching back and forth between a small number
of mms would have no significant overhead.

On an unrelated note, what happens if you migrate_disable(), sleep for
a looooong time, and someone tries to offline your CPU?