BUG unpinning 1 GiB huge pages with KVM PCI assignment

Greg Edwards <gedwards@xxxxxxx> · Mon, 28 Oct 2013 13:37:56 -0600

Using KVM PCI assignment with 1 GiB huge pages trips a BUG in 3.12.0-rc7, e.g.

# qemu-system-x86_64 \
	-m 8192 \
	-mem-path /var/lib/hugetlbfs/pagesize-1GB \
	-mem-prealloc \
	-enable-kvm \
	-device pci-assign,host=1:0.0 \
	-drive file=/var/tmp/vm.img,cache=none

[  287.081736] ------------[ cut here ]------------
[  287.086364] kernel BUG at mm/hugetlb.c:654!
[  287.090552] invalid opcode: 0000 [#1] PREEMPT SMP 
[  287.095407] Modules linked in: pci_stub autofs4 sunrpc iptable_filter ip_tables ip6table_filter ip6_tables x_tables binfmt_misc freq_table processor x86_pkg_temp_thermal kvm_intel kvm crc32_pclmul microcode serio_raw i2c_i801 evdev sg igb i2c_algo_bit i2c_core ptp pps_core mlx4_core button ext4 jbd2 mbcache crc16 usbhid sd_mod
[  287.124916] CPU: 15 PID: 25668 Comm: qemu-system-x86 Not tainted 3.12.0-rc7 #1
[  287.132140] Hardware name: DataDirect Networks SFA12KX/SFA12000, BIOS 21.0m4 06/28/2013
[  287.140145] task: ffff88007c732e60 ti: ffff881ff1d3a000 task.ti: ffff881ff1d3a000
[  287.147620] RIP: 0010:[<ffffffff811395e1>]  [<ffffffff811395e1>] free_huge_page+0x1d1/0x1e0
[  287.155992] RSP: 0018:ffff881ff1d3ba88  EFLAGS: 00010213
[  287.161309] RAX: 0000000000000000 RBX: ffffffff818bcd80 RCX: 0000000000000012
[  287.168446] RDX: 020000000000400c RSI: 0000000000001000 RDI: 0000000040000000
[  287.175574] RBP: ffff881ff1d3bab8 R08: 0000000000000000 R09: 0000000000000002
[  287.182705] R10: 0000000000000000 R11: 0000000000000000 R12: ffffea007c000000
[  287.189834] R13: 020000000000400c R14: 0000000000000000 R15: 00000000ffffffff
[  287.196964] FS:  00007f13722d5840(0000) GS:ffff88287f660000(0000) knlGS:0000000000000000
[  287.205048] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  287.210790] CR2: ffffffffff600400 CR3: 0000001fee3f5000 CR4: 00000000001427e0
[  287.217918] Stack:
[  287.219931]  0000000000000001 ffffea007c000000 0000000001f00000 ffff881fe3d88500
[  287.227390]  00000000000e0000 00000000ffffffff ffff881ff1d3bad8 ffffffff81102f9c
[  287.234849]  0000000000000246 ffffea007c000000 ffff881ff1d3baf8 ffffffff811035c0
[  287.242308] Call Trace:
[  287.244762]  [<ffffffff81102f9c>] __put_compound_page+0x1c/0x30
[  287.250680]  [<ffffffff811035c0>] put_compound_page+0x80/0x200
[  287.256516]  [<ffffffff81103d05>] put_page+0x45/0x50
[  287.261487]  [<ffffffffa019f070>] kvm_release_pfn_clean+0x50/0x60 [kvm]
[  287.268098]  [<ffffffffa01a62d5>] kvm_iommu_put_pages+0xb5/0xe0 [kvm]
[  287.274542]  [<ffffffffa01a6315>] kvm_iommu_unmap_pages+0x15/0x20 [kvm]
[  287.281160]  [<ffffffffa01a638a>] kvm_iommu_unmap_memslots+0x6a/0x90 [kvm]
[  287.288038]  [<ffffffffa01a68b7>] kvm_assign_device+0xa7/0x140 [kvm]
[  287.294398]  [<ffffffffa01a5e6c>] kvm_vm_ioctl_assigned_device+0x78c/0xb40 [kvm]
[  287.301795]  [<ffffffff8113baa1>] ? alloc_pages_vma+0xb1/0x1b0
[  287.307632]  [<ffffffffa01a089e>] kvm_vm_ioctl+0x1be/0x5b0 [kvm]
[  287.313645]  [<ffffffff811220fd>] ? remove_vma+0x5d/0x70
[  287.318963]  [<ffffffff8103ecec>] ? __do_page_fault+0x1fc/0x4b0
[  287.324886]  [<ffffffffa01b49ec>] ? kvm_dev_ioctl_check_extension+0x8c/0xd0 [kvm]
[  287.332370]  [<ffffffffa019fba6>] ? kvm_dev_ioctl+0xa6/0x460 [kvm]
[  287.338551]  [<ffffffff8115e049>] do_vfs_ioctl+0x89/0x4c0
[  287.343953]  [<ffffffff8115e521>] SyS_ioctl+0xa1/0xb0
[  287.349007]  [<ffffffff814c1552>] system_call_fastpath+0x16/0x1b
[  287.355011] Code: e6 48 89 df 48 89 42 08 48 89 10 4d 89 54 24 20 4d 89 4c 24 28 e8 70 bc ff ff 48 83 6b 38 01 42 83 6c ab 08 01 eb 91 0f 0b eb fe <0f> 0b eb fe 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 57 
[  287.374986] RIP  [<ffffffff811395e1>] free_huge_page+0x1d1/0x1e0
[  287.381007]  RSP <ffff881ff1d3ba88>
[  287.384508] ---[ end trace 82c719f97df2e524 ]---
[  287.389129] Kernel panic - not syncing: Fatal exception
[  287.394378] ------------[ cut here ]------------

This is on an Ivy Bridge system, so it has IOMMU with snoop control, hence the
map/unmap/map sequence on device assignment to get the cache coherency right.
It appears we are unpinning tail pages we never pinned the first time through
kvm_iommu_map_memslots().  This kernel does not have THP enabled, if that makes
a difference.

Interestingly, with this patch

  http://www.spinics.net/lists/kvm/msg97561.html

we no longer trip the BUG, but on qemu exit, we leak memory, as the huge pages
don't go back into the free pool.  It's likely just masking the original issue.

I haven't been successful in finding the bug yet.  Ideas on where to look?

Greg
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html