Possible use_mm() mis-uses

christian.koenig@xxxxxxx (Christian König) · Wed, 22 Aug 2018 20:13:39 +0200

Adding Felix because the KFD part of amdgpu is actually his responsibility.

If I'm not completely mistaken the release callback of the mmu_notifier 
should take care of that for amdgpu.

Regards,
Christian.

Am 22.08.2018 um 18:44 schrieb Linus Torvalds:
> Guys and gals,
>   this is a *very* random list of people on the recipients list, but we
> had a subtle TLB shootdown issue in the VM, and that brought up some
> issues when people then went through the code more carefully.
>
> I think we have a handle on the TLB shootdown bug itself. But when
> people were discussing all the possible situations, one thing that
> came up was "use_mm()" that takes a mm, and makes it temporarily the
> mm for a kernel thread (until "unuse_mm()", duh).
>
> And it turns out that some of those uses are definitely wrong, some of
> them are right, and some of them are suspect or at least so overly
> complicated that it's hard for the VM people to know if they are ok.
>
> Basically, the rule for "use_mm()" is that the mm in question *has* to
> have a valid page table associated with it over the whole use_mm() ->
> unuse_mm() sequence. That may sound obvious, and I guess it actually
> is so obvious that there isn't even a comment about it, but the actual
> users are showing that it's sadly apparently not so obvious after all.
>
> There is one user that uses the "obviously correct" model: the vhost
> driver does a "mmget()" and "mmput()" pair around its use of it,
> thanks to vhost_dev_set_owner() doing a
>
>          dev->mm = get_task_mm(current);
>
> to look up the mm, and then the teardown case does a
>
>          if (dev->mm)
>                  mmput(dev->mm);
>          dev->mm = NULL;
>
> This is the *right* sequence. A gold star to the vhost people.
>
> Sadly, the vhost people are the only ones who seem to get things
> unquestionably right. And some of those gold star people are also
> apparently involved in the cases that didn't get things right.
>
> An example of something that *isn't* right, is the i915 kvm interface,
> which does
>
>          use_mm(kvm->mm);
>
> on an mm that was initialized in virt/kvm/kvm_main.c using
>
>          mmgrab(current->mm);
>          kvm->mm = current->mm;
>
> which is *not* right. Using "mmgrab()" does indeed guarantee the
> lifetime of the 'struct mm_struct' itself, but it does *not* guarantee
> the lifetime of the page tables. You need to use "mmget()" and
> "mmput()", which get the reference to the actual process address
> space!
>
> Now, it is *possible* that the kvm use is correct too, because kvm
> does register a mmu_notifier chain, and in theory you can avoid the
> proper refcounting by just making sure the mmu "release" notifier
> kills any existing uses, but I don't really see kvm doing that. Kvm
> does register a release notifier, but that just flushes the shadow
> page tables, it doesn't kill any use_mm() use by some i915 use case.
>
> So while the vhost use looks right, the kvm/i915 use looks definitely wrong.
>
> The other users of "use_mm()" and "unuse_mm()" are less
> black-and-white right vs wrong..
>
> One of the complex ones is the amdgpu driver. It does a
> "use_mm(mmptr)" deep deep in the guts of a macro that ends up being
> used in fa few places, and it's very hard to tell if it's right.
>
> It looks almost certainly buggy (there is no mmget/mmput to get the
> refcount), but there _is_ a "release" mmu_notifier function and that
> one - unlike the kvm case - looks like it might actually be trying to
> flush the existing pending users of that mm.
>
> But on the whole, I'm suspicious of the amdgpu use. It smells. Jann
> Horn pointed out that even if it migth be ok due to the mmu_notifier,
> the comments are garbage:
>
>>   Where "process" in the uniquely-named "struct queue" is a "struct
>>   kfd_process"; that struct's definition has this comment in it:
>>
>>         /*
>>          * Opaque pointer to mm_struct. We don't hold a reference to
>>          * it so it should never be dereferenced from here. This is
>>          * only used for looking up processes by their mm.
>>          */
>>         void *mm;
>>
>>   So I think either that comment is wrong, or their code is wrong?
> so I'm chalking the amdgpu use up in the "broken" column.
>
> It's also actually quite hard to synchronze with some other kernel
> worker thread correctly, so just on general principles, if you use
> "use_mm()" it really really should be on something that you've
> properly gotten a mm refcount on with mmget(). Really. Even if you try
> to synchronize things.
>
> The two final cases are two uses in the USB gadget driver. Again, they
> don't have the proper mmget/mmput, but they may br ok simply because
> the uses are done for AIO, and the VM teardown is preceded by an AIO
> teardown, so the proper serialization may come in from that.
>
> Anyway, sorry for the long email, and the big list of participants and
> odd mailing lists, but I'd really like people to look at their
> "use_mm()" cases, and ask themselves if they have done enough to
> guarantee that the full mm exists. Again, "mmgrab()" is *not* enough
> on its own. You need either "mmget()" or some lifetime guarantee.
>
> And if you do have those lifetime guarantees, it would be really nice
> to get a good explanatory comment about said lifetime guarantees above
> the "use_mm()" call. Ok?
>
> Note that the lifetime rules are very important, because obviously
> use_mm() itself is never called in the context of the process that has
> the mm. By definition, we're in a kernel thread and it uses somebody
> elses mm. So it's important to show that that "somebody else" really
> is serialized with the kernel thread somehow, and keeps the mm alive.
> The easiest way by far to have that guarantee is to have that
> "mmget/mmput" model.
>
> Please? Even if you're convinced your code is right, just a comment
> about _why_ it's right, ok?
>
> And no, use_mm() cannot just do the mmget internally, only to have
> unuse_mm() do the mmput().  That was our first reaction when looking
> at this, but if the caller has screwed up the lifetime rules, it's not
> actually clear that the mm is guaranteed to be around even when use_mm
> is entered. So the onus of proper lifetime rules really is on the
> caller, not on use_mm()/unuse_mm().
>
>               Linus