Am 2021-07-01 um 4:15 p.m. schrieb Linus Torvalds: > On Wed, Jun 30, 2021 at 9:34 PM Dave Airlie <airlied@xxxxxxxxx> wrote: >> Hi Linus, >> >> This is the main drm pull request for 5.14-rc1. >> >> I've done a test pull into your current tree, and hit two conflicts >> (one in vc4, one in amdgpu), both seem pretty trivial, the amdgpu one >> is recent and sfr sent out a resolution for it today. > Well, the resolutions may be trivial, but the conflict made me look at > the code, and it's buggy. > > Commit 04d8d73dbcbe ("drm/amdgpu: add common HMM get pages function") > is broken. It made the code do > > mmap_read_lock(mm); > vma = find_vma(mm, start); > mmap_read_unlock(mm); > > and then it *uses* that "vma" after it has dropped the lock. > > That's a big no-no - once you've dropped the lock, the vma contents > simply aren't reliable any more. That mapping could now be unmapped > and removed at any time. > > Now, the conflict actually made one of the uses go away (switching to > vma_lookup() means that the subsequent code no longer needs to look at > "vm_start" to verify we're actually _inside_ the vma), but it still > checks for vma->vm_file afterwards. > > So those locking changes in commit 04d8d73dbcbe are completely bogus. > > I tried to fix up that bug while handling the conflict, but who knows > what else similar is going on elsewhere. > > So I would ask people to > > (a) verify that I didn't make things worse as I fixed things up (note > how I had to change the last argument to amdgpu_hmm_range_get_pages() > from false to true etc). > > (b) go and look at their vma lookup code: you can't just look up a > vma under the lock, and then drop the lock, and then think things stay > stable. > > In particular for that (b) case: it is *NOT* enough to look up > vma->vm_file inside the lock and cache that. No - if the test is about > "no backing file before looking up pages", then you have to *keep* > holding the lock until after you've actually looked up the pages! > > Because otherwise any test for "vma->vm_file" is entirely pointless, > for the same reason it's buggy to even look at it after dropping the > lock: because once you've dropped the lock, the thing you just tested > for might not be true any more. > > So no, it's not valid to do > > bool has_file = vma && vma->vm_file; > > and then drop the lock, because you don't use 'vma' any more as a > pointer, and then use 'has_file' outside the lock. Because after > you've dropped the lock, 'has_file' is now meaningless. > > So it's not just about "you can't look at vma->vm_file after dropping > the lock". It's more fundamental than that. Any *decision* you make > based on the vma is entirely pointless and moot after the lock is > dropped! > > Did I fix it up correctly? Who knows. The code makes more sense to me > now and seems valid. But I really *really* want to stress how locking > is important. Thank you for the fix and the explanation. Your fix looks correct. I also double-checked all other uses of find_vma in the amdgpu driver. They all hold the mmap lock correctly. Two comments: With this fix, we could remove the bool mmap_locked parameter from amdgpu_hmm_range_get_pages because it always gets called with the lock held now. You're now holding the mmap lock from the vma_lookup until hmm_range_fault is done. This ensures that the result of the vma->vm_file check remains valid. This was broken even before our commit 04d8d73dbcbe ("drm/amdgpu: add common HMM get pages function"). > > You also can't just unlock in the middle of an operation - even if you > then take the lock *again* later (as amdgpu_hmm_range_get_pages() then > did), the fact that you unlocked in the middle means that all the > earlier tests you did are simply no longer valid when you re-take the > lock. I agree completely. I catch a lot of locking bugs in code review. I probably missed this one because I wasn't paying enough attention to what was being protected by the mmap_read_lock in this case. Regards, Felix > > Linus