Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Andy, Linus,

On Sun, Jan 09, 2022 at 12:48:42PM -0800, Linus Torvalds wrote:
> On Sun, Jan 9, 2022 at 12:20 PM Andy Lutomirski <luto@xxxxxxxxxx> wrote:
> >
> > Are you *sure*? The ASID management code on x86 is (as mentioned
> > before) completely unaware of whether an ASID is actually in use
> > anywhere.
> 
> Right.
> 
> But the ASID situation on x86 is very very different, exactly because
> x86 doesn't have cross-CPU TLB invalidates.
> 
> Put another way: x86 TLB hardware is fundamentally per-cpu. As such,
> any ASID management is also per-cpu.
> 
> That's fundamentally not true on arm64.  And that's not some "arm64
> implementation detail". That's fundamental to doing cross-CPU TLB
> invalidates in hardware.
> 
> If your TLB invalidates act across CPU's, then the state they act on
> are also obviously across CPU's.
> 
> So the ASID situation is fundamentally different depending on the
> hardware usage. On x86, TLB's are per-core, and on arm64 they are not,
> and that's reflected in our code too.
> 
> As a result, on x86, each mm has a per-cpu ASID, and there's a small
> array of per-cpu "mm->asid" mappings.
> 
> On arm, each mm has an asid, and it's allocated from a global asid
> space - so there is no need for that "mm->asid" mapping, because the
> asid is there in the mm, and it's shared across cpus.
> 
> That said, I still don't actually know the arm64 ASID management code.

That appears to be a common theme in this thread, so hopefully I can shed
some light on the arm64 side of things:

The CPU supports either 8-bit or 16-bit ASIDs and we require that we don't
have more CPUs than we can represent in the ASID space (well, we WARN in
this case but it's likely to go badly wrong). We reserve ASID 0 for things
like the idmap, so as far as the allocator is concerned ASID 0 is "invalid"
and we rely on this.

As Linus says above, the ASID is per-'mm' and we require that all threads
of an 'mm' use the same ASID at the same time, otherwise the hardware TLB
broadcasting isn't going to work properly because the invalidations are
typically tagged by ASID.

As Andy points out later, this means that we have to reuse ASIDs for
different 'mm's once we have enough of them. We do this using a 64-bit
context ID in mm_context_t, where the lower bits are the ASID for the 'mm'
and the upper bits are a generation count. The ASID allocator keeps an
internal generation count which is incremented whenever we fail to allocate
an ASID and are forced to invalidate them all and start re-allocating. We
assume that the generation count doesn't overflow.

When switching to an 'mm', we check if the generation count held in the
'mm' is behind the allocator's current generation count. If it is, then
we know that the 'mm' needs to be allocated a new ASID. Allocation is
performed with a spinlock held and basically involves a setting a new bit
in the bitmap and updating the 'mm' with the new ASID and current
generation. We don't reclaim ASIDs greedily on 'mm' teardown -- this was
pretty slow when I looked at it in the past.

So far so good, but it gets more complicated when we look at the details of
the overflow handling. Overflow is always detected on the allocation path
with the spinlock held but other CPUs could happily be running other code
(inc. user code) at this point. Therefore, we can't simply invalidate the
TLBs, clear the bitmap and start re-allocating ASIDs because we could end up
with an ASID shared between two running 'mm's, leading to both invalidation
interference but also the potential to hit stale TLB entries allocated after
the invalidation on rollover. We handle this with a couple of per-cpu
variables, 'active_asids' and 'reserved_asids'.

'active_asids' is set to the current ASID in switch_mm() just before
writing the actual TTBR register. On a rollover, the CPU holding the lock
goes through each CPU's 'active_asids' entry, atomic xchg()s it to 0 and
writes the result into the corresponding 'reserved_asids' entry. These
'reserved_asids' are then immediately marked as allocated and a flag is
set for each CPU to indicate that its TLBs are dirty. This allows the
CPU handling the rollover to continue with its allocation without stopping
the world and without broadcasting TLB invalidation; other CPUs will
hit a generation mismatch on their next switch_mm(), notice that they are
running a reserved ASID from an older generation, upgrade the generation
(i.e. keep the same ASID) and then invalidate their local TLB.

So we do have some tracking of which ASIDs are where, but we can't generally
say "is this ASID dirty in the TLBs of this CPU". That also gets more
complicated on some systems where a TLB can be shared between some of the
CPUs (I haven't covered that case above, since I think that this is enough
detail already.)

FWIW, we have a TLA+ model of some of this, which may (or may not) be easier
to follow than my text:

https://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/kernel-tla.git/tree/asidalloc.tla

although the syntax is pretty hard going :(

> The thing about TLB flushes is that it's ok to do them spuriously (as
> long as you don't do _too_ many of them and tank performance), so two
> different mm's can have the same hw ASID on two different cores and
> that just makes cross-CPU TLB invalidates too aggressive. You can't
> share an ASID on the _same_ core without flushing in between context
> switches, because then the TLB on that core might be re-used for a
> different mm. So the flushing rules aren't necessarily 100% 1:1 with
> the "in use" rules, and who knows if the arm64 ASID management
> actually ends up just matching what that whole "this lazy TLB is still
> in use on another CPU".

The shared TLBs (Arm calls this "Common-not-private") make this problematic,
as the TLB is no longer necessarily per-core.

> So I don't really know the arm64 situation. And i's possible that lazy
> TLB isn't even worth it on arm64 in the first place.

ASID allocation aside, I think there are a few useful things to point out
for arm64:

	- We only have "local" or "all" TLB invalidation; nothing targetted
	  (and for KVM guests this is always "all").

	- Most mms end up running on more than one CPU (at least, when I
	  last looked at this a fork+exec would end up with the mm having
	  been installed on two CPUs)

	- We don't track mm_cpumask as it showed up as a bottleneck in the
	  past and, because of the earlier points, it wasn't very useful
	  anyway

	- mmgrab() should be fast for us (it's a posted atomic add),
	  although mmdrop() will be slower as it has to return data to
	  check against the count going to zero.

So it doesn't feel like an obvious win to me for us to scan these new hazard
pointers on arm64. At least, I would love to see some numbers if we're going
to make changes here.

Will



[Index of Archives]     [Linux Kernel]     [Kernel Newbies]     [x86 Platform Driver]     [Netdev]     [Linux Wireless]     [Netfilter]     [Bugtraq]     [Linux Filesystems]     [Yosemite Discussion]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]

  Powered by Linux