Hi Linus: Thanks for letting us know that. We would fix this ASAP. The kvmgt.c module is a part of GVT-g code. It's our fault that we didn't find this mis-uses, not i915 or KVM guys. Wish they would feel better after seeing this message. Thanks, Zhi. On 08/23/18 00:44, Linus Torvalds wrote: > Guys and gals, > this is a *very* random list of people on the recipients list, but we > had a subtle TLB shootdown issue in the VM, and that brought up some > issues when people then went through the code more carefully. > > I think we have a handle on the TLB shootdown bug itself. But when > people were discussing all the possible situations, one thing that > came up was "use_mm()" that takes a mm, and makes it temporarily the > mm for a kernel thread (until "unuse_mm()", duh). > > And it turns out that some of those uses are definitely wrong, some of > them are right, and some of them are suspect or at least so overly > complicated that it's hard for the VM people to know if they are ok. > > Basically, the rule for "use_mm()" is that the mm in question *has* to > have a valid page table associated with it over the whole use_mm() -> > unuse_mm() sequence. That may sound obvious, and I guess it actually > is so obvious that there isn't even a comment about it, but the actual > users are showing that it's sadly apparently not so obvious after all. > > There is one user that uses the "obviously correct" model: the vhost > driver does a "mmget()" and "mmput()" pair around its use of it, > thanks to vhost_dev_set_owner() doing a > > dev->mm = get_task_mm(current); > > to look up the mm, and then the teardown case does a > > if (dev->mm) > mmput(dev->mm); > dev->mm = NULL; > > This is the *right* sequence. A gold star to the vhost people. > > Sadly, the vhost people are the only ones who seem to get things > unquestionably right. And some of those gold star people are also > apparently involved in the cases that didn't get things right. > > An example of something that *isn't* right, is the i915 kvm interface, > which does > > use_mm(kvm->mm); > > on an mm that was initialized in virt/kvm/kvm_main.c using > > mmgrab(current->mm); > kvm->mm = current->mm; > > which is *not* right. Using "mmgrab()" does indeed guarantee the > lifetime of the 'struct mm_struct' itself, but it does *not* guarantee > the lifetime of the page tables. You need to use "mmget()" and > "mmput()", which get the reference to the actual process address > space! > > Now, it is *possible* that the kvm use is correct too, because kvm > does register a mmu_notifier chain, and in theory you can avoid the > proper refcounting by just making sure the mmu "release" notifier > kills any existing uses, but I don't really see kvm doing that. Kvm > does register a release notifier, but that just flushes the shadow > page tables, it doesn't kill any use_mm() use by some i915 use case. > > So while the vhost use looks right, the kvm/i915 use looks definitely wrong. > > The other users of "use_mm()" and "unuse_mm()" are less > black-and-white right vs wrong.. > > One of the complex ones is the amdgpu driver. It does a > "use_mm(mmptr)" deep deep in the guts of a macro that ends up being > used in fa few places, and it's very hard to tell if it's right. > > It looks almost certainly buggy (there is no mmget/mmput to get the > refcount), but there _is_ a "release" mmu_notifier function and that > one - unlike the kvm case - looks like it might actually be trying to > flush the existing pending users of that mm. > > But on the whole, I'm suspicious of the amdgpu use. It smells. Jann > Horn pointed out that even if it migth be ok due to the mmu_notifier, > the comments are garbage: > >> Where "process" in the uniquely-named "struct queue" is a "struct >> kfd_process"; that struct's definition has this comment in it: >> >> /* >> * Opaque pointer to mm_struct. We don't hold a reference to >> * it so it should never be dereferenced from here. This is >> * only used for looking up processes by their mm. >> */ >> void *mm; >> >> So I think either that comment is wrong, or their code is wrong? > > so I'm chalking the amdgpu use up in the "broken" column. > > It's also actually quite hard to synchronze with some other kernel > worker thread correctly, so just on general principles, if you use > "use_mm()" it really really should be on something that you've > properly gotten a mm refcount on with mmget(). Really. Even if you try > to synchronize things. > > The two final cases are two uses in the USB gadget driver. Again, they > don't have the proper mmget/mmput, but they may br ok simply because > the uses are done for AIO, and the VM teardown is preceded by an AIO > teardown, so the proper serialization may come in from that. > > Anyway, sorry for the long email, and the big list of participants and > odd mailing lists, but I'd really like people to look at their > "use_mm()" cases, and ask themselves if they have done enough to > guarantee that the full mm exists. Again, "mmgrab()" is *not* enough > on its own. You need either "mmget()" or some lifetime guarantee. > > And if you do have those lifetime guarantees, it would be really nice > to get a good explanatory comment about said lifetime guarantees above > the "use_mm()" call. Ok? > > Note that the lifetime rules are very important, because obviously > use_mm() itself is never called in the context of the process that has > the mm. By definition, we're in a kernel thread and it uses somebody > elses mm. So it's important to show that that "somebody else" really > is serialized with the kernel thread somehow, and keeps the mm alive. > The easiest way by far to have that guarantee is to have that > "mmget/mmput" model. > > Please? Even if you're convinced your code is right, just a comment > about _why_ it's right, ok? > > And no, use_mm() cannot just do the mmget internally, only to have > unuse_mm() do the mmput(). That was our first reaction when looking > at this, but if the caller has screwed up the lifetime rules, it's not > actually clear that the mm is guaranteed to be around even when use_mm > is entered. So the onus of proper lifetime rules really is on the > caller, not on use_mm()/unuse_mm(). > > Linus >