Jerome Glisse <j.glisse@xxxxxxxxx> writes: > [ text/plain ] > On Wed, Mar 23, 2016 at 12:22:23PM +0530, Aneesh Kumar K.V wrote: >> Jérôme Glisse <jglisse@xxxxxxxxxx> writes: >> >> > [ text/plain ] >> > This patch add helper for device page fault. Thus helpers will fill >> > the mirror page table using the CPU page table and synchronizing >> > with any update to CPU page table. >> > >> > Changed since v1: >> > - Add comment about directory lock. >> > >> > Changed since v2: >> > - Check for mirror->hmm in hmm_mirror_fault() >> > >> > Changed since v3: >> > - Adapt to HMM page table changes. >> > >> > Changed since v4: >> > - Fix PROT_NONE, ie do not populate from protnone pte. >> > - Fix huge pmd handling (start address may != pmd start address) >> > - Fix missing entry case. >> > >> > Signed-off-by: Jérôme Glisse <jglisse@xxxxxxxxxx> >> > Signed-off-by: Sherry Cheung <SCheung@xxxxxxxxxx> >> > Signed-off-by: Subhash Gutti <sgutti@xxxxxxxxxx> >> > Signed-off-by: Mark Hairgrove <mhairgrove@xxxxxxxxxx> >> > Signed-off-by: John Hubbard <jhubbard@xxxxxxxxxx> >> > Signed-off-by: Jatin Kumar <jakumar@xxxxxxxxxx> >> > --- >> >> >> .... >> .... >> >> +static int hmm_mirror_fault_hpmd(struct hmm_mirror *mirror, >> > + struct hmm_event *event, >> > + struct vm_area_struct *vma, >> > + struct hmm_pt_iter *iter, >> > + pmd_t *pmdp, >> > + struct hmm_mirror_fault *mirror_fault, >> > + unsigned long start, >> > + unsigned long end) >> > +{ >> > + struct page *page; >> > + unsigned long addr, pfn; >> > + unsigned flags = FOLL_TOUCH; >> > + spinlock_t *ptl; >> > + int ret; >> > + >> > + ptl = pmd_lock(mirror->hmm->mm, pmdp); >> > + if (unlikely(!pmd_trans_huge(*pmdp))) { >> > + spin_unlock(ptl); >> > + return -EAGAIN; >> > + } >> > + flags |= event->etype == HMM_DEVICE_WFAULT ? FOLL_WRITE : 0; >> > + page = follow_trans_huge_pmd(vma, start, pmdp, flags); >> > + pfn = page_to_pfn(page); >> > + spin_unlock(ptl); >> > + >> > + /* Just fault in the whole PMD. */ >> > + start &= PMD_MASK; >> > + end = start + PMD_SIZE - 1; >> > + >> > + if (!pmd_write(*pmdp) && event->etype == HMM_DEVICE_WFAULT) >> > + return -ENOENT; >> > + >> > + for (ret = 0, addr = start; !ret && addr < end;) { >> > + unsigned long i, next = end; >> > + dma_addr_t *hmm_pte; >> > + >> > + hmm_pte = hmm_pt_iter_populate(iter, addr, &next); >> > + if (!hmm_pte) >> > + return -ENOMEM; >> > + >> > + i = hmm_pt_index(&mirror->pt, addr, mirror->pt.llevel); >> > + >> > + /* >> > + * The directory lock protect against concurrent clearing of >> > + * page table bit flags. Exceptions being the dirty bit and >> > + * the device driver private flags. >> > + */ >> > + hmm_pt_iter_directory_lock(iter); >> > + do { >> > + if (!hmm_pte_test_valid_pfn(&hmm_pte[i])) { >> > + hmm_pte[i] = hmm_pte_from_pfn(pfn); >> > + hmm_pt_iter_directory_ref(iter); >> >> I looked at that and it is actually >> static inline void hmm_pt_iter_directory_ref(struct hmm_pt_iter *iter) >> { >> BUG_ON(!iter->ptd[iter->pt->llevel - 1]); >> hmm_pt_directory_ref(iter->pt, iter->ptd[iter->pt->llevel - 1]); >> } >> >> static inline void hmm_pt_directory_ref(struct hmm_pt *pt, >> struct page *ptd) >> { >> if (!atomic_inc_not_zero(&ptd->_mapcount)) >> /* Illegal this should not happen. */ >> BUG(); >> } >> >> what is the mapcount update about ? > > Unlike regular CPU page table we do not rely on unmap to prune HMM mirror > page table. Rather we free/prune it aggressively once the device no longer > have anything mirror in a given range. Which patch does this ? > > As such mapcount is use to keep track of any many valid entry there is per > directory. > > Moreover mapcount is also use to protect from concurrent pruning when > you walk through the page table you increment refcount by one along your > way. When you done walking you decrement refcount. > > Because of that last aspect, the mapcount can never reach zero because we > unmap page, it can only reach zero once we cleanup the page table walk. > >> >> > + } >> > + BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pfn); >> > + if (pmd_write(*pmdp)) >> > + hmm_pte_set_write(&hmm_pte[i]); >> > + } while (addr += PAGE_SIZE, pfn++, i++, addr != next); >> > + hmm_pt_iter_directory_unlock(iter); >> > + mirror_fault->addr = addr; >> > + } >> > + >> >> So we don't have huge page mapping in hmm page table ? > > No we don't right now. First reason is that i wanted to keep things simple for > device driver. Second motivation is to keep first patchset simpler especialy > the page migration code. > > Memory overhead is 2MB per GB of virtual memory mirrored. There is no TLB here. > I believe adding huge page can be done as part of a latter patchset if it makes > sense. > One of the thing I am wondering is can we do the patch series in such a way that we move the page table mirror to device driver. That is an hmm fault will look at cpu page table and call into a device driver callback with the pte entry details. It is upto the device driver to maintain a mirror table if needed. Similarly for cpu fault we call into hmm callback to find per pte dma_addr and do a migrate using copy_from_device callback. I haven't fully looked at how easy this would be, but I guess lot of the code in this series got to do with mirror table and I wondering is there a simpler version we can get upstream that hides it within a driver. Also does it simply to have interfaces that operates on one pte than an array of ptes ? -aneesh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href