Le 05/03/2016 à 12:11 PM, Jerome Glisse
a écrit :
On Mon, May 02, 2016 at 09:04:02PM -0700, Hugh Dickins wrote:On Fri, 29 Apr 2016, Nicolas Morey Chaisemartin wrote:Hi everyone, This is a repost from a different address as it seems the previous one ended in Gmail junk due to a domain error..linux-kernel is a very high volume list which few are reading: that also will account for your lack of response so far (apart from the indefatigable Alan). I've added linux-mm, and some people from another thread regarding THP and get_user_pages() pins which has been discussed in recent days. Make no mistake, the issue you're raising here is definitely not the same as that one (which is specifically about the new THP refcounting in v4.5+, whereas you're reporting a problem you've seen in both a v3.10-based kernel and in v4.5). But I think their heads are in gear, much more so than mine, and likely to spot something.I added more info found while blindly debugging the issue. Short version: I'm having an issue with direct DMA transfer from a device to host memory. It seems some of the data is not transferring to the appropriate page. Some more details: I'm debugging a home made PCI driver for our board (Kalray), attached to a x86_64 host running centos7 (3.10.0-327.el7.x86_64) In the current case, a userland application transfers back and forth data through read/write operations on a file. On the kernel side, it triggers DMA transfers through the PCI to/from our board memory. We followed what pretty much all docs said about direct I/O to user buffers: 1) get_user_pages() (in the current case, it's at most 16 pages at once) 2) convert to a scatterlist 3) pci_map_sg 4) eventually coalesce sg (Intel IOMMU is enabled, so it's usually possible) 4) A lot of DMA engine handling code, using the dmaengine layer and virt-dma 5) wait for transfer complete, in the mean time, go back to (1) to schedule more work, if any 6) pci_unmap_sg 7) for read (card2host) transfer, set_page_dirty_lock 8) page_cache_release In 99,9999% it works perfectly. However, I have one userland application where a few pages are not written by a read (card2host) transfer. The buffer is memset them to a different value so I can check that nothing has overwritten them. I know (PCI protocol analyser) that the data left our board for the "right" address (the one set in the sg by pci_map_sg). I tried reading the data between the pci_unmap_sg and the set_page_dirty, using uint32_t *addr = page_address(trans->pages[0]); dev_warn(&pdata->pdev->dev, "val = %x\n", *addr); and it has the expected value. But if I try to copy_from_user (using the address coming from userland, the one passed to get_user_pages), the data has not been written and I see the memset value. New infos: The issue happens with IOMMU on or off. I compiled a kernel with DMA_API_DEBUG enabled and got no warnings or errors. I digged a little bit deeper with my very small understanding of linux mm and I discovered that: * we are using transparent huge pages * the page 'not transferred' are the last few of a huge page More precisely: - We have several transfer in flight from the same user buffer - Each transfer is 16 pages long - At one point in time, we start transferring from another huge page (transfers are still in flight from the previous one) - When a transfer from the previous huge page completes, I dumped at the mapcount of the pages from the previous transfers, they are all to 0. The pages are still mapped to dma at this point. - A get_user_page to the address of the completed transfer returns return a different struct page * then the on I had. But this is before I have unmapped/put_page them back. From my understanding this should not have happened. I tried the same code with a kernel 4.5 and encountered the same issue Disabling transparent huge pages makes the issue disapear Thanks in advanceIt does look to me as if pages are being migrated, despite being pinned by get_user_pages(): and that would be wrong. Originally I intended to suggest that THP is probably merely the cause of compaction, with compaction causing the page migration. But you posted very interesting details in an earlier mail on 27th April from <nmorey@xxxxxxxxx>:I ran some more tests: * Test is OK if transparent huge tlb are disabled * For all the page where data are not transfered, and only those pages, a call to get_user_page(user vaddr) just before dma_unmap_sg returns a different page from the original one. [436477.927279] mppa 0000:03:00.0: org_page= ffffea0009f60080 cur page = ffffea00074e0080 [436477.927298] page:ffffea0009f60080 count:0 mapcount:1 mapping: (null) index:0x2 [436477.927314] page flags: 0x2fffff00008000(tail) [436477.927354] page dumped because: org_page [436477.927369] page:ffffea00074e0080 count:0 mapcount:1 mapping: (null) index:0x2 [436477.927382] page flags: 0x2fffff00008000(tail) [436477.927421] page dumped because: cur_page I'm not sure what to make of this...That (on the older kernel I think) seems clearly to show that a THP itself has been migrated: which makes me suspect NUMA migration of mispaced THPs - migrate_misplaced_transhuge_page(). I'd hoped to find something obviously wrong there, but haven't quite managed to bring my brain fully to bear on it, and hope the others Cc'ed will do so more quickly (or spot the error of your ways instead). I do find it suspect, how the migrate_page_copy() is done rather early, while the old page is still mapped in the pagetable. And odd how it inserts the new pmd for a moment, before checking old page_count and backing out. But I don't see how either of those would cause the trouble you see, where the migration goes ahead.So i do not think there is a bug migrate_misplaced_transhuge_page() but i think something is wrong in it see attached patch. I still want to convince myself i am not missing anything before posting that one. Now about this bug, dumb question but do you do get_user_pages with write = 1 because if your device is writting to the page then you must set write to 1. get_user_pages(vaddr, nrpages, 1, 0|1, pages, NULL|vmas); Cheers, Jérôme Hi, I backported the patch to 3.10 (had to copy paste pmd_protnone defitinition from 4.5) and it's working ! I'll open a ticket in Redhat tracker to try and get this fixed in RHEL7. I have a dumb question though: how can we end up in numa/misplaced memory code on a single socket system? Thanks a lot Nicolas |