On Wed, Jun 30, 2021 at 9:34 PM Dave Airlie <airlied@xxxxxxxxx> wrote: > > Hi Linus, > > This is the main drm pull request for 5.14-rc1. > > I've done a test pull into your current tree, and hit two conflicts > (one in vc4, one in amdgpu), both seem pretty trivial, the amdgpu one > is recent and sfr sent out a resolution for it today. Well, the resolutions may be trivial, but the conflict made me look at the code, and it's buggy. Commit 04d8d73dbcbe ("drm/amdgpu: add common HMM get pages function") is broken. It made the code do mmap_read_lock(mm); vma = find_vma(mm, start); mmap_read_unlock(mm); and then it *uses* that "vma" after it has dropped the lock. That's a big no-no - once you've dropped the lock, the vma contents simply aren't reliable any more. That mapping could now be unmapped and removed at any time. Now, the conflict actually made one of the uses go away (switching to vma_lookup() means that the subsequent code no longer needs to look at "vm_start" to verify we're actually _inside_ the vma), but it still checks for vma->vm_file afterwards. So those locking changes in commit 04d8d73dbcbe are completely bogus. I tried to fix up that bug while handling the conflict, but who knows what else similar is going on elsewhere. So I would ask people to (a) verify that I didn't make things worse as I fixed things up (note how I had to change the last argument to amdgpu_hmm_range_get_pages() from false to true etc). (b) go and look at their vma lookup code: you can't just look up a vma under the lock, and then drop the lock, and then think things stay stable. In particular for that (b) case: it is *NOT* enough to look up vma->vm_file inside the lock and cache that. No - if the test is about "no backing file before looking up pages", then you have to *keep* holding the lock until after you've actually looked up the pages! Because otherwise any test for "vma->vm_file" is entirely pointless, for the same reason it's buggy to even look at it after dropping the lock: because once you've dropped the lock, the thing you just tested for might not be true any more. So no, it's not valid to do bool has_file = vma && vma->vm_file; and then drop the lock, because you don't use 'vma' any more as a pointer, and then use 'has_file' outside the lock. Because after you've dropped the lock, 'has_file' is now meaningless. So it's not just about "you can't look at vma->vm_file after dropping the lock". It's more fundamental than that. Any *decision* you make based on the vma is entirely pointless and moot after the lock is dropped! Did I fix it up correctly? Who knows. The code makes more sense to me now and seems valid. But I really *really* want to stress how locking is important. You also can't just unlock in the middle of an operation - even if you then take the lock *again* later (as amdgpu_hmm_range_get_pages() then did), the fact that you unlocked in the middle means that all the earlier tests you did are simply no longer valid when you re-take the lock. Linus