On Tue, Jul 25, 2023 at 04:05:50PM -0300, Jason Gunthorpe wrote: > Even though the test suite covers this it somehow became obscured that > this wasn't working. > > The test iommufd_ioas.mock_domain.access_domain_destory would blow up > rarely. > > end should be set to 1 because this just pushed an item, the carry, to the > pfns list. > > Sometimes the test would blow up with: > > BUG: kernel NULL pointer dereference, address: 0000000000000000 > #PF: supervisor read access in kernel mode > #PF: error_code(0x0000) - not-present page > PGD 0 P4D 0 > Oops: 0000 [#1] SMP > CPU: 5 PID: 584 Comm: iommufd Not tainted 6.5.0-rc1-dirty #1236 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 > RIP: 0010:batch_unpin+0xa2/0x100 [iommufd] > Code: 17 48 81 fe ff ff 07 00 77 70 48 8b 15 b7 be 97 e2 48 85 d2 74 14 48 8b 14 fa 48 85 d2 74 0b 40 0f b6 f6 48 c1 e6 04 48 01 f2 <48> 8b 3a 48 c1 e0 06 89 ca 48 89 de 48 83 e7 f0 48 01 c7 e8 96 dc > RSP: 0018:ffffc90001677a58 EFLAGS: 00010246 > RAX: 00007f7e2646f000 RBX: 0000000000000000 RCX: 0000000000000001 > RDX: 0000000000000000 RSI: 00000000fefc4c8d RDI: 0000000000fefc4c > RBP: ffffc90001677a80 R08: 0000000000000048 R09: 0000000000000200 > R10: 0000000000030b98 R11: ffffffff81f3bb40 R12: 0000000000000001 > R13: ffff888101f75800 R14: ffffc90001677ad0 R15: 00000000000001fe > FS: 00007f9323679740(0000) GS:ffff8881ba540000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000000 CR3: 0000000105ede003 CR4: 00000000003706a0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > <TASK> > ? show_regs+0x5c/0x70 > ? __die+0x1f/0x60 > ? page_fault_oops+0x15d/0x440 > ? lock_release+0xbc/0x240 > ? exc_page_fault+0x4a4/0x970 > ? asm_exc_page_fault+0x27/0x30 > ? batch_unpin+0xa2/0x100 [iommufd] > ? batch_unpin+0xba/0x100 [iommufd] > __iopt_area_unfill_domain+0x198/0x430 [iommufd] > ? __mutex_lock+0x8c/0xb80 > ? __mutex_lock+0x6aa/0xb80 > ? xa_erase+0x28/0x30 > ? iopt_table_remove_domain+0x162/0x320 [iommufd] > ? lock_release+0xbc/0x240 > iopt_area_unfill_domain+0xd/0x10 [iommufd] > iopt_table_remove_domain+0x195/0x320 [iommufd] > iommufd_hw_pagetable_destroy+0xb3/0x110 [iommufd] > iommufd_object_destroy_user+0x8e/0xf0 [iommufd] > iommufd_device_detach+0xc5/0x140 [iommufd] > iommufd_selftest_destroy+0x1f/0x70 [iommufd] > iommufd_object_destroy_user+0x8e/0xf0 [iommufd] > iommufd_destroy+0x3a/0x50 [iommufd] > iommufd_fops_ioctl+0xfb/0x170 [iommufd] > __x64_sys_ioctl+0x40d/0x9a0 > do_syscall_64+0x3c/0x80 > entry_SYSCALL_64_after_hwframe+0x46/0xb0 > > Cc: <stable@xxxxxxxxxxxxxxx> > Fixes: f394576eb11d ("iommufd: PFN handling for iopt_pages") > Reported-by: Nicolin Chen <nicolinc@xxxxxxxxxx> > Signed-off-by: Jason Gunthorpe <jgg@xxxxxxxxxx> This fixes the memory leak at the HugePages, and likely the rarely triggered BUG too since I see no repro after applying this patch. Tested-by: Nicolin Chen <nicolinc@xxxxxxxxxx> Thanks!