I'm not that familiar with the kernel's workqueues, but this seems like the classic "callback outlives the memory it references" use-after-free, where the process_srcu callback is outliving struct kvm (which contains the srcu_struct). If that's right, then calling srcu_barrier (which should wait for all of the call_srcu callbacks to complete, which are what enqueue the process_srcu callbacks) before cleanup_srcu_struct in kvm_destroy_vm probably fixes this. The corresponding patch to virt/kvm/kvm_main.c looks something like: static void kvm_destroy_vm(struct kvm *kvm) { ... for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) kvm_free_memslots(kvm, kvm->memslots[i]); + srcu_barrier(&kvm->irq_srcu); cleanup_srcu_struct(&kvm->irq_srcu); + srcu_barrier(&kvm->srcu); cleanup_srcu_struct(&kvm->srcu); ... Since we don't have a repro, this obviously won't be readily testable. I find srcu subtle enough that I don't trust my reasoning fully (in particular, I don't trust that waiting for all of the call_srcu callbacks to complete also waits for all of the process_srcu callbacks). Someone else know if that's the case? Steve On Sun, Dec 11, 2016 at 12:49 AM, Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote: > On Sun, Dec 11, 2016 at 9:40 AM, Vegard Nossum <vegard.nossum@xxxxxxxxx> wrote: >> On 11 December 2016 at 07:46, Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote: >>> Hello, >>> >>> I am getting the following use-after-free reports while running >>> syzkaller fuzzer. >>> On commit 318c8932ddec5c1c26a4af0f3c053784841c598e (Dec 7). >>> Unfortunately it is not reproducible, but all reports look sane and >>> very similar, so I would assume that it is some hard to trigger race. >>> In all cases the use-after-free offset within struct kvm is 344 bytes. >>> This points to srcu field, which starts at 208 with size 360 (I have >>> some debug configs enabled). >> [...] >>> [ 376.024345] [<ffffffff81a77f7e>] __fput+0x34e/0x910 fs/file_table.c:208 >>> [ 376.024345] [<ffffffff81a785ca>] ____fput+0x1a/0x20 fs/file_table.c:244 >> >> I've been hitting what I think is a struct file refcounting bug which >> causes similar symptoms as you have here (the struct file is freed >> while somebody still has an active reference to it). >> >>> [ 376.024345] [<ffffffff81483c20>] task_work_run+0x1a0/0x280 >>> kernel/task_work.c:116 >>> [ 376.024345] [< inline >] exit_task_work include/linux/task_work.h:21 >>> [ 376.024345] [<ffffffff814129e2>] do_exit+0x1842/0x2650 kernel/exit.c:828 >>> [ 376.024345] [<ffffffff814139ae>] do_group_exit+0x14e/0x420 kernel/exit.c:932 >>> [ 376.024345] [<ffffffff81442b43>] get_signal+0x663/0x1880 >>> kernel/signal.c:2307 >>> [ 376.024345] [<ffffffff81239b45>] do_signal+0xc5/0x2190 >>> arch/x86/kernel/signal.c:807 >> >> Was this or any other process by any chance killed by the OOM killer? >> That seems to be a pattern in the crashes I've seen. If not, do you >> know what killed this process? > > > Difficult to say as I can't reproduce them. > I've looked at the logs I have and there are no OOM kills, only some > kvm-related messages: > > [ 372.188708] kvm [12528]: vcpu0, guest rIP: 0xfff0 > kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x2, nop > [ 372.321334] kvm [12528]: vcpu0, guest rIP: 0xfff0 unhandled wrmsr: > 0x0 data 0x0 > [ 372.426831] kvm [12593]: vcpu512, guest rIP: 0xfff0 unhandled > wrmsr: 0x5 data 0x200 > [ 372.646417] irq bypass consumer (token ffff880052f74780) > registration fails: -16 > [ 373.001273] pit: kvm: requested 1676 ns i8254 timer period limited > to 500000 ns > [ 375.541449] kvm [13011]: vcpu0, guest rIP: 0x110000 unhandled > wrmsr: 0x0 data 0x2 > [ 376.005387] ================================================================== > [ 376.024345] BUG: KASAN: use-after-free in process_srcu+0x27a/0x280 > at addr ffff88005e29a418 > > [ 720.214985] kvm: vcpu 0: requested 244148 ns lapic timer period > limited to 500000 ns > [ 720.271334] kvm: vcpu 0: requested 244148 ns lapic timer period > limited to 500000 ns > [ 720.567985] kvm_vm_ioctl_assign_device: host device not found > [ 721.094589] kvm [22114]: vcpu0, guest rIP: 0x2 unhandled wrmsr: 0x6 data 0x8 > [ 723.829467] ================================================================== > [ 723.829467] BUG: KASAN: use-after-free in process_srcu+0x27a/0x280 > at addr ffff88005a4d10d8 > > Logs capture ~3-4 seconds before the crash. > However, syzkaller test processes tend to consume lots of memory from > time to time and cause low memory conditions. > > Kills are usually caused by my test driver that kills test processes > after short time. > > However, I do see other assorted bugs caused by kvm that are induced > by OOM kills: > https://groups.google.com/d/msg/syzkaller/ytVPh93HLnI/KhZdengZBwAJ -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html