Re: [git pull] drm for 5.14-rc1

Felix Kuehling <felix.kuehling@xxxxxxx> · Thu, 1 Jul 2021 17:31:15 -0400

Am 2021-07-01 um 4:15 p.m. schrieb Linus Torvalds:
> On Wed, Jun 30, 2021 at 9:34 PM Dave Airlie <airlied@xxxxxxxxx> wrote:
>> Hi Linus,
>>
>> This is the main drm pull request for 5.14-rc1.
>>
>> I've done a test pull into your current tree, and hit two conflicts
>> (one in vc4, one in amdgpu), both seem pretty trivial, the amdgpu one
>> is recent and sfr sent out a resolution for it today.
> Well, the resolutions may be trivial, but the conflict made me look at
> the code, and it's buggy.
>
> Commit 04d8d73dbcbe ("drm/amdgpu: add common HMM get pages function")
> is broken. It made the code do
>
>         mmap_read_lock(mm);
>         vma = find_vma(mm, start);
>         mmap_read_unlock(mm);
>
> and then it *uses* that "vma" after it has dropped the lock.
>
> That's a big no-no - once you've dropped the lock, the vma contents
> simply aren't reliable any more. That mapping could now be unmapped
> and removed at any time.
>
> Now, the conflict actually made one of the uses go away (switching to
> vma_lookup() means that the subsequent code no longer needs to look at
> "vm_start" to verify we're actually _inside_ the vma), but it still
> checks for vma->vm_file afterwards.
>
> So those locking changes in commit 04d8d73dbcbe are completely bogus.
>
> I tried to fix up that bug while handling the conflict, but who knows
> what else similar is going on elsewhere.
>
> So I would ask people to
>
>  (a) verify that I didn't make things worse as I fixed things up (note
> how I had to change the last argument to amdgpu_hmm_range_get_pages()
> from false to true etc).
>
>  (b) go and look at their vma lookup code: you can't just look up a
> vma under the lock, and then drop the lock, and then think things stay
> stable.
>
> In particular for that (b) case: it is *NOT* enough to look up
> vma->vm_file inside the lock and cache that. No - if the test is about
> "no backing file before looking up pages", then you have to *keep*
> holding the lock until after you've actually looked up the pages!
>
> Because otherwise any test for "vma->vm_file" is entirely pointless,
> for the same reason it's buggy to even look at it after dropping the
> lock: because once you've dropped the lock, the thing you just tested
> for might not be true any more.
>
> So no, it's not valid to do
>
>     bool has_file = vma && vma->vm_file;
>
> and then drop the lock, because you don't use 'vma' any more as a
> pointer, and then use 'has_file' outside the lock. Because after
> you've dropped the lock, 'has_file' is now meaningless.
>
> So it's not just about "you can't look at vma->vm_file after dropping
> the lock". It's more fundamental than that. Any *decision* you make
> based on the vma is entirely pointless and moot after the lock is
> dropped!
>
> Did I fix it up correctly? Who knows. The code makes more sense to me
> now and seems valid. But I really *really* want to stress how locking
> is important.

Thank you for the fix and the explanation. Your fix looks correct. I
also double-checked all other uses of find_vma in the amdgpu driver.
They all hold the mmap lock correctly.

Two comments:

With this fix, we could remove the bool mmap_locked parameter from
amdgpu_hmm_range_get_pages because it always gets called with the lock
held now.

You're now holding the mmap lock from the vma_lookup until
hmm_range_fault is done. This ensures that the result of the
vma->vm_file check remains valid. This was broken even before our commit
04d8d73dbcbe ("drm/amdgpu: add common HMM get pages function").

>
> You also can't just unlock in the middle of an operation - even if you
> then take the lock *again* later (as amdgpu_hmm_range_get_pages() then
> did), the fact that you unlocked in the middle means that all the
> earlier tests you did are simply no longer valid when you re-take the
> lock.

I agree completely. I catch a lot of locking bugs in code review. I
probably missed this one because I wasn't paying enough attention to
what was being protected by the mmap_read_lock in this case.

Regards,
  Felix

>
>                  Linus