On Sun, Jan 9, 2022 at 6:40 PM Rik van Riel <riel@xxxxxxxxxxx> wrote: > > Also, while 800 loads is kinda expensive, it is a heck of > a lot less expensive than 800 IPIs. Rik, the IPI's you have to do *anyway*. So there are exactly zero extra IPI's. Go take a look. It's part of the whole "flush TLB's" thing in __mmput(). So let me explain one more time what I think we should have done, at least on x86: (1) stop refcounting active_mm entries entirely on x86 Why can we do that? Because instead of worrying about doing those mm_count games for the active_mm reference, we realize that any active_mm has to have a _regular_ mm associated with it, and it has a 'mm_users' count. And when that mm_users count go to zero, we have: (2) mmput -> __mmput -> exit_mmap(), which already has to flush all TLB's because it's tearing down the page tables And since it has to flush those TLB's as part of tearing down the page tables, we on x86 then have: (3) that TLB flush will have to do the IPI's to anybody who has that mm active already and that IPI has to be done *regardless*. And the TLB flushing done by that IPI? That code already clears the lazy status (and not doing so would be pointless and in fact wrong). Notice? There isn't some "800 loads". There isn't some "800 IPI's". And there isn't any refcounting cost of the lazy TLB. Well, right now there *is* that refcounting cost, but my point is that I don't think it should exist. It shouldn't exist as an atomic access to mm_count (with those cache ping-pongs when you have a lot of threads across a lot of CPUs), but it *also* shouldn't exist as a "lightweight hazard pointer". See my point? I think the lazy-tlb refcounting we do is pointless if you have to do IPI's for TLB flushes. Note: the above is for x86, which has to do the IPI's anyway (and it's very possible that if you don't have to do IPI's because you have HW TLB coherency, maybe lazy TLB's aren't what you should be using, but I think that should be a separate discussion). And yes, right now we do that pointless reference counting, because it was simple and straightforward, and people historically didn't see it as a problem. Plus we had to have that whole secondary 'mm_count' anyway for other reasons, since we use it for things that need to keep a ref to 'struct mm_struct' around regardless of page table counts (eg things like a lot of /proc references to 'struct mm_struct' do not want to keep forced references to user page tables alive). But I think conceptually mm_count (ie mmgrab/mmdrop) was always really dodgy for lazy TLB. Lazy TLB really cares about the page tables still being there, and that's not what mm_count is ostensibly about. That's really what mm_users is about. Yet mmgrab/mmdrop is exactly what the lazy TLB code uses, even if it's technically odd (ie mmgrab really only keeps the 'struct mm' around, but not about the vma's and page tables). Side note: you can see the effects of this mis-use of mmgrab/mmdrop in how we tear down _almost_ all the page table state in __mmput(). But look at what we keep until the final __mmdrop, even though there are no page tables left: mm_free_pgd(mm); destroy_context(mm); exactly because even though we've torn down all the page tables earlier, we had to keep the page table *root* around for the lazy case. It's kind of a layering violation, but it comes from that lazy-tlb mm_count use, and so we have that odd situation where the page table directory lifetime is very different from the rest of the page table lifetimes. (You can easily make excuses for it by just saying that "mm_users" is the user-space page table user count, and that the page directory has a different lifetime because it's also about the kernel page tables, so it's all a bit of a gray area, but I do think it's also a bit of a sign of how our refcounting for lazy-tlb is a bit dodgy). Linus