On Tue, 1 Oct 2024 at 18:04, Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> wrote: > > Hazard pointers appear to be a good fit for replacing refcount based lazy > active mm tracking. If the mm refcount is this expensive, I suspect we really shouldn't use it at all. The thing is, we don't _need_ to use the mm refcount - the reason the lazy-tlb handling uses it is because we already had that refcount and it was easy to extend on existing logic, not because it's really required any more. The lazy-tlb activation is basically "I'm switching to a kernel thread, so I'll re-use the TLB state of the previous thread". (And yes, it also has a secondary case of "I'm exiting, so I will turn the mm I already have into a lazy one"). But in the actual task switch case, the previous thread hasn't _lost_ that mm, so we don't actually need to take the refcount at all. We really just need to make sure to invalidate it before it's torn down, but we do that *anyway* as part of TLB flushing. (The exit case is actually different: we are setting it up to be lost, although delayed - and the lazy count is the delay). The only thing the refcount means is that we don't actually have to be as careful when we actually *really* get rid of the MM. We can be a bit laissez-faire about things because even if we weren't to invalidate the lazy mm, it does have its own refcount, so we don't much care. But in reality, we're actually very careful about the active_mm _anyway_, because of a fairly fundamental issue: the TLB shootdown and PCID handling that we need to do even when mm's aren't lazy. So we actually keep track of things like "which CPU's have seen this MM state" in all the TLB code. And even the exit case doesn't actually need the special thing - it *does* need the "this CPU is still using this MM", but we have that too as part of the TLB code - entirely independently of 'active_mm'. So in many ways, I'm pretty sure not just the refcount, but all of 'active_mm', is largely pointless to begin with. And if the refcount really is this big of a deal: > nr threads (-t) speedup > 192 +28% then we should probably just strive to get rid of 'active_mm' altogether. Look, at least on x86 we ALREADY has a better replacement: it's the percpu 'cpu_tlbstate'. It basically duplicates all we do with active_mm and the whole "keep track of old mm state" (the 'loaded_mm' member is basically the true 'active' mm), except it has some additional fixes: - it has some extra housekeeping data that the architecture wants (for PCID updates etc) - it's actually atomic wrt the low-level code in ways that 'current->active_mm' isn't So I think the real issue is that "active_mm" is an old hack from a bygone era when we didn't have the (much more involved) full TLB tracking. Linus