On Tue, May 03, 2016 at 12:11:54PM +0200, Jerome Glisse wrote: > On Mon, May 02, 2016 at 09:04:02PM -0700, Hugh Dickins wrote: > > On Fri, 29 Apr 2016, Nicolas Morey Chaisemartin wrote: > > > > > Hi everyone, > > > > > > This is a repost from a different address as it seems the previous one ended in Gmail junk due to a domain error.. > > > > linux-kernel is a very high volume list which few are reading: > > that also will account for your lack of response so far > > (apart from the indefatigable Alan). > > > > I've added linux-mm, and some people from another thread regarding > > THP and get_user_pages() pins which has been discussed in recent days. > > > > Make no mistake, the issue you're raising here is definitely not the > > same as that one (which is specifically about the new THP refcounting > > in v4.5+, whereas you're reporting a problem you've seen in both a > > v3.10-based kernel and in v4.5). But I think their heads are in > > gear, much more so than mine, and likely to spot something. > > > > > I added more info found while blindly debugging the issue. > > > > > > Short version: > > > I'm having an issue with direct DMA transfer from a device to host memory. > > > It seems some of the data is not transferring to the appropriate page. > > > > > > Some more details: > > > I'm debugging a home made PCI driver for our board (Kalray), attached to a x86_64 host running centos7 (3.10.0-327.el7.x86_64) > > > > > > In the current case, a userland application transfers back and forth data through read/write operations on a file. > > > On the kernel side, it triggers DMA transfers through the PCI to/from our board memory. > > > > > > We followed what pretty much all docs said about direct I/O to user buffers: > > > > > > 1) get_user_pages() (in the current case, it's at most 16 pages at once) > > > 2) convert to a scatterlist > > > 3) pci_map_sg > > > 4) eventually coalesce sg (Intel IOMMU is enabled, so it's usually possible) > > > 4) A lot of DMA engine handling code, using the dmaengine layer and virt-dma > > > 5) wait for transfer complete, in the mean time, go back to (1) to schedule more work, if any > > > 6) pci_unmap_sg > > > 7) for read (card2host) transfer, set_page_dirty_lock > > > 8) page_cache_release > > > > > > In 99,9999% it works perfectly. > > > However, I have one userland application where a few pages are not written by a read (card2host) transfer. > > > The buffer is memset them to a different value so I can check that nothing has overwritten them. > > > > > > I know (PCI protocol analyser) that the data left our board for the "right" address (the one set in the sg by pci_map_sg). > > > I tried reading the data between the pci_unmap_sg and the set_page_dirty, using > > > uint32_t *addr = page_address(trans->pages[0]); > > > dev_warn(&pdata->pdev->dev, "val = %x\n", *addr); > > > and it has the expected value. > > > But if I try to copy_from_user (using the address coming from userland, the one passed to get_user_pages), the data has not been written and I see the memset value. > > > > > > New infos: > > > > > > The issue happens with IOMMU on or off. > > > I compiled a kernel with DMA_API_DEBUG enabled and got no warnings or errors. > > > > > > I digged a little bit deeper with my very small understanding of linux mm and I discovered that: > > > * we are using transparent huge pages > > > * the page 'not transferred' are the last few of a huge page > > > More precisely: > > > - We have several transfer in flight from the same user buffer > > > - Each transfer is 16 pages long > > > - At one point in time, we start transferring from another huge page (transfers are still in flight from the previous one) > > > - When a transfer from the previous huge page completes, I dumped at the mapcount of the pages from the previous transfers, > > > they are all to 0. The pages are still mapped to dma at this point. > > > - A get_user_page to the address of the completed transfer returns return a different struct page * then the on I had. > > > But this is before I have unmapped/put_page them back. From my understanding this should not have happened. > > > > > > I tried the same code with a kernel 4.5 and encountered the same issue > > > > > > Disabling transparent huge pages makes the issue disapear > > > > > > Thanks in advance > > > > It does look to me as if pages are being migrated, despite being pinned > > by get_user_pages(): and that would be wrong. Originally I intended > > to suggest that THP is probably merely the cause of compaction, with > > compaction causing the page migration. But you posted very interesting > > details in an earlier mail on 27th April from <nmorey@xxxxxxxxx>: > > > > > I ran some more tests: > > > > > > * Test is OK if transparent huge tlb are disabled > > > > > > * For all the page where data are not transfered, and only those pages, a call to get_user_page(user vaddr) just before dma_unmap_sg returns a different page from the original one. > > > [436477.927279] mppa 0000:03:00.0: org_page= ffffea0009f60080 cur page = ffffea00074e0080 > > > [436477.927298] page:ffffea0009f60080 count:0 mapcount:1 mapping: (null) index:0x2 > > > [436477.927314] page flags: 0x2fffff00008000(tail) > > > [436477.927354] page dumped because: org_page > > > [436477.927369] page:ffffea00074e0080 count:0 mapcount:1 mapping: (null) index:0x2 > > > [436477.927382] page flags: 0x2fffff00008000(tail) > > > [436477.927421] page dumped because: cur_page > > > > > > I'm not sure what to make of this... > > > > That (on the older kernel I think) seems clearly to show that a THP > > itself has been migrated: which makes me suspect NUMA migration of > > mispaced THPs - migrate_misplaced_transhuge_page(). I'd hoped to > > find something obviously wrong there, but haven't quite managed > > to bring my brain fully to bear on it, and hope the others Cc'ed > > will do so more quickly (or spot the error of your ways instead). > > > > I do find it suspect, how the migrate_page_copy() is done rather > > early, while the old page is still mapped in the pagetable. And > > odd how it inserts the new pmd for a moment, before checking old > > page_count and backing out. But I don't see how either of those > > would cause the trouble you see, where the migration goes ahead. > > So i do not think there is a bug migrate_misplaced_transhuge_page() > but i think something is wrong in it see attached patch. I still > want to convince myself i am not missing anything before posting > that one. > > > Now about this bug, dumb question but do you do get_user_pages with > write = 1 because if your device is writting to the page then you > must set write to 1. > > get_user_pages(vaddr, nrpages, 1, 0|1, pages, NULL|vmas); > > > Cheers, > Jérôme > From 9ded2a5da75a5e736fb36a2c4e2511d9516ecc37 Mon Sep 17 00:00:00 2001 > From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= <jglisse@xxxxxxxxxx> > Date: Tue, 3 May 2016 11:53:24 +0200 > Subject: [PATCH] mm/numa/thp: fix assumptions of > migrate_misplaced_transhuge_page() > MIME-Version: 1.0 > Content-Type: text/plain; charset=UTF-8 > Content-Transfer-Encoding: 8bit > > Fix assumptions in migrate_misplaced_transhuge_page() which is only > call by do_huge_pmd_numa_page() itself only call by __handle_mm_fault() > for pmd with PROT_NONE. This means that if the pmd stays the same > then there can be no concurrent get_user_pages / get_user_pages_fast > (GUP/GUP_fast). More over because migrate_misplaced_transhuge_page() > only do something is page is map once then there can be no GUP from > a different process. Finaly, holding the pmd lock assure us that no > other part of the kernel will take an extre reference on the page. > > In the end this means that the failure code path should never be > taken unless something is horribly wrong, so convert it to BUG_ON(). > > Signed-off-by: Jérôme Glisse <jglisse@xxxxxxxxxx> > Cc: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx > Cc: Mel Gorman <mgorman@xxxxxxx> > Cc: Hugh Dickins <hughd@xxxxxxxxxx> > Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx> The logic looks valid to me: Acked-by: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx> > --- > mm/migrate.c | 31 +++++++++++++++++++++---------- > 1 file changed, 21 insertions(+), 10 deletions(-) > > diff --git a/mm/migrate.c b/mm/migrate.c > index 6c822a7..6315aac 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1757,6 +1757,14 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, > pmd_t orig_entry; > > /* > + * What we do here is only valid if pmd_protnone(entry) is true and it > + * is map in only one vma numamigrate_isolate_page() takes care of that > + * check. > + */ > + if (!pmd_protnone(entry)) > + goto out_unlock; > + > + /* > * Rate-limit the amount of data that is being migrated to a node. > * Optimal placement is no good if the memory bus is saturated and > * all the time is being spent migrating! > @@ -1797,7 +1805,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, > mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); > ptl = pmd_lock(mm, pmd); > if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) { > -fail_putback: > spin_unlock(ptl); > mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); > > @@ -1819,7 +1826,12 @@ fail_putback: > goto out_unlock; > } > > - orig_entry = *pmd; > + /* > + * We are holding the lock so no one can set a new pmd and original pmd > + * is PROT_NONE thus no one can get_user_pages or get_user_pages_fast > + * (GUP or GUP_fast) from this point on we can not fail. > + */ > + orig_entry = entry; > entry = mk_pmd(new_page, vma->vm_page_prot); > entry = pmd_mkhuge(entry); > entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); > @@ -1837,14 +1849,13 @@ fail_putback: > set_pmd_at(mm, mmun_start, pmd, entry); > update_mmu_cache_pmd(vma, address, &entry); > > - if (page_count(page) != 2) { > - set_pmd_at(mm, mmun_start, pmd, orig_entry); > - flush_pmd_tlb_range(vma, mmun_start, mmun_end); > - mmu_notifier_invalidate_range(mm, mmun_start, mmun_end); > - update_mmu_cache_pmd(vma, address, &entry); > - page_remove_rmap(new_page, true); > - goto fail_putback; > - } > + /* As said above no one can get reference on the old page nor through > + * get_user_pages or get_user_pages_fast (GUP/GUP_fast) or through > + * any other means. To get reference on huge page you need to hold > + * pmd_lock and we are already holding that lock here and the page > + * is only mapped once. > + */ > + BUG_ON(page_count(page) != 2); > > mlock_migrate_page(new_page, page); > page_remove_rmap(page, true); > -- > 2.1.0 > -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>