From: SeongJae Park <sjpark@xxxxxxxxx> On Thu, 26 Aug 2021 23:42:19 +0200 David Hildenbrand <david@xxxxxxxxxx> wrote: > On 26.08.21 19:29, SeongJae Park wrote: > > From: SeongJae Park <sjpark@xxxxxxxxx> > > > > Hello David, > > > > > > On Thu, 26 Aug 2021 16:09:23 +0200 David Hildenbrand <david@xxxxxxxxxx> wrote: > > > >>> +static void damon_va_mkold(struct mm_struct *mm, unsigned long addr) > >>> +{ > >>> + pte_t *pte = NULL; > >>> + pmd_t *pmd = NULL; > >>> + spinlock_t *ptl; > >>> + > >> > >> I just stumbled over this, sorry for the dumb questions: > > > > Appreciate for the great questions! > > > >> > >> > >> a) What do we know about that region we are messing with? > >> > >> AFAIU, just like follow_pte() and follow_pfn(), follow_invalidate_pte() > >> should only be called on VM_IO and raw VM_PFNMAP mappings in general > >> (see the doc of follow_pte()). Do you even know that it's within a > >> single VMA and that there are no concurrent modifications? > > > > We have no idea about the region at this moment. However, if we successfully > > get the pte or pmd under the protection of the page table lock, we ensure the > > page for the pte or pmd is a online LRU-page with damon_get_page(), before > > updating the pte or pmd's PAGE_ACCESSED bit. We release the page table lock > > only after the update. > > > > And concurrent VMA change doesn't matter here because we read and write only > > the page table. If the address is not mapped or not backed by LRU pages, we > > simply treat it as not accessed. > > reading/writing page tables is the real problem. > > > > >> > >> b) Which locks are we holding? > >> > >> I hope we're holding the mmap lock in read mode at least. Or how are you > >> making sure there are no concurrent modifications to page tables / VMA > >> layout ... ? > >> > >>> + if (follow_invalidate_pte(mm, addr, NULL, &pte, &pmd, &ptl)) > > > > All the operations are protected by the page table lock of the pte or pmd, so > > no concurrent page table modification would happen. As previously mentioned, > > because we read and update only page table, we don't care about VMAs and > > therefore we don't need to hold mmap lock here. > > See below, that's unfortunately not sufficient. > > > > > Outside of this function, DAMON reads the VMAs to know which address ranges are > > not mapped, and avoid inefficiently checking access to the area with the > > information. Nevertheless, it happens only occasionally (once per 60 seconds > > by default), and it holds the mmap read lock in the case. > > > > Nonetheless, I agree the usage of follow_invalidate_pte() here could make > > readers very confusing. It would be better to implement and use DAMON's own > > page table walk logic. Of course, I might missing something important. If you > > think so, please don't hesitate at yelling to me. > > > I'm certainly not going to yell :) But unfortunately I'll have to tell > you that what you are doing is in my understanding fundamentally broken. > > See, page tables might get removed any time > a) By munmap() code even while holding the mmap semaphore in read (!) > b) By khugepaged holding the mmap lock in write mode > > The rules are (ignoring the rmap side of things) > > a) You can walk page tables inside a known VMA with the mmap semaphore > held in read mode. If you drop the mmap sem, you have to re-validate the > VMA! Anything could have changed in the meantime. This is essentially > what mm/pagewalk.c does. > > b) You can walk page tables ignoring VMAs with the mmap semaphore held > in write mode. > > c) You can walk page tables lockless if the architecture supports it and > you have interrupts disabled the hole time. But you are not allowed to > write. > > With what you're doing, you might end up reading random garbage as page > table pointers, or writing random garbage to pages that are no longer > used as page tables. > > Take a look at mm/gup.c:lockless_pages_from_mm() to see how difficult it > is to walk page tables lockless. And it only works because page table > freeing code synchronizes either via IPI or fake-rcu before actually > freeing a page table. > > follow_invalidate_pte() is, in general, the wrong thing to use. It's > specialized to VM_IO and VM_PFNMAP. take a look at the difference in > complexity between follow_invalidate_pte() and mm/pagewalk.c! > > I'm really sorry, but as far as I can tell, this is locking-wise broken > and follow_invalidate_pte() is the wrong interface to use here. > > Someone can most certainly correct me if I'm wrong, or if I'm missing > something regarding your implementation, but if you take a look around, > you won't find any code walking page tables without at least holding the > mmap sem in read mode -- for a good reason. Thank you very much for this kind explanation, David! I will send a patch for this soon. Thanks, SJ > > -- > Thanks, > > David / dhildenb >