Excerpts from Nicholas Piggin's message of June 14, 2021 2:47 pm: > Excerpts from Nicholas Piggin's message of June 14, 2021 2:14 pm: >> Excerpts from Andy Lutomirski's message of June 14, 2021 1:52 pm: >>> On 6/13/21 5:45 PM, Nicholas Piggin wrote: >>>> Excerpts from Andy Lutomirski's message of June 9, 2021 2:20 am: >>>>> On 6/4/21 6:42 PM, Nicholas Piggin wrote: >>>>>> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm >>>>>> when it is context switched. This can be disabled by architectures that >>>>>> don't require this refcounting if they clean up lazy tlb mms when the >>>>>> last refcount is dropped. Currently this is always enabled, which is >>>>>> what existing code does, so the patch is effectively a no-op. >>>>>> >>>>>> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is. >>>>> >>>>> I am in favor of this approach, but I would be a lot more comfortable >>>>> with the resulting code if task->active_mm were at least better >>>>> documented and possibly even guarded by ifdefs. >>>> >>>> active_mm is fairly well documented in Documentation/active_mm.rst IMO. >>>> I don't think anything has changed in 20 years, I don't know what more >>>> is needed, but if you can add to documentation that would be nice. Maybe >>>> moving a bit of that into .c and .h files? >>>> >>> >>> Quoting from that file: >>> >>> - however, we obviously need to keep track of which address space we >>> "stole" for such an anonymous user. For that, we have "tsk->active_mm", >>> which shows what the currently active address space is. >>> >>> This isn't even true right now on x86. >> >> From the perspective of core code, it is. x86 might do something crazy >> with it, but it has to make it appear this way to non-arch code that >> uses active_mm. >> >> Is x86's scheme documented? >> >>> With your patch applied: >>> >>> To support all that, the "struct mm_struct" now has two counters: a >>> "mm_users" counter that is how many "real address space users" there are, >>> and a "mm_count" counter that is the number of "lazy" users (ie anonymous >>> users) plus one if there are any real users. >>> >>> isn't even true any more. >> >> Well yeah but the active_mm concept hasn't changed. The refcounting >> change is hopefully reasonably documented? >> >>> >>> >>>>> x86 bare metal currently does not need the core lazy mm refcounting, and >>>>> x86 bare metal *also* does not need ->active_mm. Under the x86 scheme, >>>>> if lazy mm refcounting were configured out, ->active_mm could become a >>>>> dangling pointer, and this makes me extremely uncomfortable. >>>>> >>>>> So I tend to think that, depending on config, the core code should >>>>> either keep ->active_mm [1] alive or get rid of it entirely. >>>> >>>> I don't actually know what you mean. >>>> >>>> core code needs the concept of an "active_mm". This is the mm that your >>>> kernel threads are using, even in the unmerged CONFIG_LAZY_TLB=n patch, >>>> active_mm still points to init_mm for kernel threads. >>> >>> Core code does *not* need this concept. First, it's wrong on x86 since >>> at least 4.15. Any core code that actually assumes that ->active_mm is >>> "active" for any sensible definition of the word active is wrong. >>> Fortunately there is no such code. >>> >>> I looked through all active_mm references in core code. We have: >>> >>> kernel/sched/core.c: it's all refcounting, although it's a bit tangled >>> with membarrier. >>> >>> kernel/kthread.c: same. refcounting and membarrier stuff. >>> >>> kernel/exit.c: exit_mm() a BUG_ON(). >>> >>> kernel/fork.c: initialization code and a warning. >>> >>> kernel/cpu.c: cpu offline stuff. wouldn't be needed if active_mm went away. >>> >>> fs/exec.c: nothing of interest >> >> I might not have been clear. Core code doesn't need active_mm if >> active_mm somehow goes away. I'm saying active_mm can't go away because >> it's needed to support (most) archs that do lazy tlb mm switching. >> >> The part I don't understand is when you say it can just go away. How? >> >>> I didn't go through drivers, but I maintain my point. active_mm is >>> there for refcounting. So please don't just make it even more confusing >>> -- do your performance improvement, but improve the code at the same >>> time: get rid of active_mm, at least on architectures that opt out of >>> the refcounting. >> >> powerpc opts out of the refcounting and can not "get rid of active_mm". >> Not even in theory. > > That is to say, it does do a type of reference management that requires > active_mm so you can argue it has not entirely opted out of refcounting. > But we're not just doing refcounting for the sake of refcounting! That > would make no sense. > > active_mm is required because that's the mm that we have switched to > (from core code's perspective), and it is integral to know when to > switch to a different mm. See how active_mm is a fundamental concept > in core code? It's part of the contract between core code and the > arch mm context management calls. reference counting follows from there > but it's not the _reason_ for this code. > > Pretend the reference problem does not exit (whether by refcounting or > shootdown or garbage collection or whatever). We still can't remove > active_mm! We need it to know how to call into arch functions like > switch_mm. > > I don't know if you just forgot that critical requirement in your above > list, or you actually are entirely using x86's mental model for this > code which is doing something entirely different that does not need it > at all. If that is the case I really don't mind some cleanup or wrapper > functions for x86 do entirely do its own thing, but if that's the case > you can't criticize core code's use of active_mm due to the current > state of x86. It's x86 that needs documentation and cleaning up. Ah, that must be where your confusion is coming from: x86's switch_mm doesn't use prev anywhere, and the reference scheme it is using appears to be under-documented, although vague references in changelogs suggest it has not actually "opted out" of active_mm refcounting. That's understandable, but please redirect your objections to the proper place. git blame suggests 3d28ebceaffab. Thanks, Nick