Re: [BUG] vfio device assignment regression with THP ref counting redesign

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 28 Apr 2016 21:17:26 +0300
"Kirill A. Shutemov" <kirill@xxxxxxxxxxxxx> wrote:

> On Thu, Apr 28, 2016 at 10:20:51AM -0600, Alex Williamson wrote:
> > Hi,
> > 
> > vfio-based device assignment makes use of get_user_pages_fast() in order
> > to pin pages for mapping through the iommu for userspace drivers.
> > Until the recent redesign of THP reference counting in the v4.5 kernel,
> > this all worked well.  Now we're seeing cases where a sanity test
> > before we release our "pinned" mapping results in a different page
> > address than what we programmed into the iommu.  So something is
> > occurring which pretty much negates the pinning we're trying to do.
> > 
> > The test program I'm using is here:
> > 
> > https://github.com/awilliam/tests/blob/master/vfio-iommu-map-unmap.c
> > 
> > Apologies for lack of makefile, simply build with gcc -o <out> <in.c>.
> > 
> > To run this, enable the IOMMU on your system - enable in BIOS plus add
> > intel_iommu=on to the kernel commandline (only Intel x86_64 tested).
> > 
> > Pick a target PCI device, it doesn't matter what it is, the test only
> > needs a device for the purpose of creating an iommu domain, the device
> > is never actually touched.  In my case I use a spare NIC at 00:19.0.
> > libvirt tools are useful for setting this up, simply run 'virsh
> > nodedev-detach pci_0000_00_19_0'.  Otherwise bind the device manually
> > to vfio-pci using the standard new_id bind (ask, I can provide
> > instructions).
> > 
> > I also tweak THP scanning to make sure it is actively trying to
> > collapse pages:
> > 
> > echo always > /sys/kernel/mm/transparent_hugepage/defrag
> > echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
> > echo 65536 > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
> > 
> > Run the test with 'vfio-iommu-map-unmap 0000:00:19.0', or your chosen
> > target device.
> > 
> > Of course to see that the mappings are moving, we need additional
> > sanity testing in the vfio iommu driver.  For that:
> > 
> > https://github.com/awilliam/linux-vfio/commit/379f324e3629349a7486018ad1cc5d4877228d1e
> > 
> > When we map memory for vfio, we use get_user_pages_fast() on the
> > process vaddr to give us a page.  page_to_pfn() then gives us the
> > physical memory address which we program into the iommu.  Obviously we
> > expect this mapping to be stable so long as we hold the page
> > reference.  On unmap we generally retrieve the physical memory address
> > from the iommu, convert it back to a page, and release our reference to
> > it.  The debug code above adds an additional sanity test where on unmap
> > we also call get_user_pages_fast() again before we're released the
> > mapping reference and compare whether the physical page address still
> > matches what we previously stored in the iommu.  On a v4.4 kernel this
> > works every time.  On v4.5+, we get mismatches in dmesg within a few
> > lines of output from the test program.
> > 
> > It's difficult to bisect around the THP reference counting redesign
> > since THP becomes disabled for much of it.  I have discovered that this
> > commit is a significant contributor:
> > 
> > 1f25fe2 mm, thp: adjust conditions when we can reuse the page on WP fault
> > 
> > Particularly the middle chunk in huge_memory.c.  Reverting this change
> > alone significantly improves the problem, but does not lead to a stable
> > system.
> > 
> > I'm not an mm expert, so I'm looking for help debugging this.  As shown
> > above this issue is reproducible without KVM, so Andrea's previous KVM
> > specific fix to this code is not applicable.  It also still occurs on
> > kernels as recent as v4.6-rc5, so the issue hasn't been silently fixed
> > yet.  I'm able to reproduce this fairly quickly with the above test,
> > but it's not hard to imagine a test w/o any iommu dependencies which
> > simply does a user directed get_user_pages_fast() on a set of userspace
> > addresses, retains the reference, and at some point later rechecks that
> > a new get_user_pages_fast() results in the same page address.  It
> > appears that any sort of device assignment, either vfio or legacy kvm,
> > should be susceptible to this issue and therefore unsafe to use on v4.5+
> > kernels without using explicit hugepages or disabling THP.  Thanks,  
> 
> I'm not able to reproduce it so far. How long does it usually take?

Generally within the first line of output from the test program.
 
> How much memory your system has? Could you share your kernel config?

24G, dual-socket Ivy Brdige EP.

Config:
https://paste.fedoraproject.org/360803/14618689/
 
> I've modified your instrumentation slightly to provide more info.
> Could you try this:

Thanks!  Results in:

[   83.429809] page:ffffea0010e57fc0 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[   83.439696] flags: 0x2fffff80000000()
[   83.443408] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[   83.454001] flags: 0x2fffff80044048(uptodate|active|head|swapbacked)
[   83.460456] page:ffffea0018a67fc0 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[   83.470298] flags: 0x6fffff80000000()
[   83.473973] page dumped because: 1
[   83.477412] page:ffffea0010e57f80 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[   83.487283] flags: 0x2fffff80000000()
[   83.490969] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[   83.501502] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[   83.508915] page:ffffea0018a67f80 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[   83.518758] flags: 0x6fffff80000000()
[   83.522443] page dumped because: 1
[   83.525874] page:ffffea0010e57f40 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[   83.535737] flags: 0x2fffff80000000()
[   83.539434] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[   83.549979] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[   83.557412] page:ffffea0018a67f40 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[   83.567260] flags: 0x6fffff80000000()
[   83.570943] page dumped because: 1
[   83.574366] page:ffffea0010e57f00 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[   83.584211] flags: 0x2fffff80000000()
[   83.587878] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[   83.598413] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[   83.605862] page:ffffea0018a67f00 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[   83.615722] flags: 0x6fffff80000000()
[   83.619399] page dumped because: 1
[   83.622835] page:ffffea0010e57ec0 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[   83.632673] flags: 0x2fffff80000000()
[   83.636363] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[   83.646893] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[   83.654302] page:ffffea0018a67ec0 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[   83.664150] flags: 0x6fffff80000000()
[   83.667840] page dumped because: 1
[   83.671255] page:ffffea0010e57e80 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[   83.681108] flags: 0x2fffff80000000()
[   83.684783] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[   83.695335] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[   83.702773] page:ffffea0018a67e80 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[   83.712640] flags: 0x6fffff80000000()
[   83.716335] page dumped because: 1
[   83.719746] page:ffffea0010e57e40 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[   83.729591] flags: 0x2fffff80000000()
[   83.733279] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[   83.743843] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[   83.751268] page:ffffea0018a67e40 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[   83.761108] flags: 0x6fffff80000000()
[   83.764784] page dumped because: 1
[   83.768206] page:ffffea0010e57e00 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[   83.778076] flags: 0x2fffff80000000()
[   83.781754] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[   83.792283] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[   83.799712] page:ffffea0018a67e00 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[   83.809559] flags: 0x6fffff80000000()
[   83.813257] page dumped because: 1
[   83.816722] page:ffffea0010e57dc0 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[   83.826605] flags: 0x2fffff80000000()
[   83.830285] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[   83.840877] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[   83.848321] page:ffffea0018a67dc0 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[   83.858214] flags: 0x6fffff80000000()
[   83.861899] page dumped because: 1
[   83.865355] page:ffffea0010e57d80 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[   83.875246] flags: 0x2fffff80000000()
[   83.878930] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[   83.889525] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[   83.896970] page:ffffea0018a67d80 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[   83.906883] flags: 0x6fffff80000000()
[   83.910563] page dumped because: 1
[   83.914018] page:ffffea0010e57d40 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[   83.923862] flags: 0x2fffff80000000()
[   83.927540] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[   83.938079] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[   83.945493] page:ffffea0018a67d40 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[   83.955341] flags: 0x6fffff80000000()
[   83.959022] page dumped because: 1
[   83.962446] page:ffffea0010e57d00 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[   83.972296] flags: 0x2fffff80000000()
[   83.975980] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[   83.986516] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[   83.993932] page:ffffea0018a67d00 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[   84.003778] flags: 0x6fffff80000000()
[   84.007456] page dumped because: 1
...

As you can see by the kernel timestamp, this happened almost
immediately for me.  Thanks for taking a look at this,

Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>



[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]