On Tue, Oct 10, 2023 at 04:30:18PM -0700, Sean Christopherson wrote: > On Tue, Oct 10, 2023, Al Viro wrote: > > On Mon, Oct 09, 2023 at 05:27:04PM -0700, Sean Christopherson wrote: > > > > > If the last reference is effectively held by guest_memfd, it would be: > > > > > > kvm_gmem_release(), a.k.a. file_operations.release() > > > | > > > -> kvm_put_kvm() > > > | > > > -> kvm_destroy_vm() > > > | > > > -> module_put(kvm_chardev_ops.owner); > > > > ... and now your thread gets preempted and loses CPU; before you get > > it back, some joker calls delete_module(), and page of code containing > > kvm_gmem_release() is unmapped. Even though an address within that > > page is stored as return address in a frame on your thread's stack. > > That thread gets the timeslice again and proceeds to return into > > unmapped page. Oops... > > *sigh* > > What an absolute snafu. Sorry for the wall of text. Feel free to stop reading > after the "Back to KVM..." part below, it's just gory details on how we screwed > things up in KVM. > > But one question before I dive into the KVM mess: other than error paths, how is > module_put(THIS_MODULE) ever safe without being superfluous? I don't see how a > module can put the last reference to itself without either hitting the above > scenario, or without creating deadlock. Something other than the module must put I think module_get/put(THIS_MODULE) is not responsible for guarding this scenario. It just makes a section against module removal. Out of this section, you still need other mechanisms to ensure no task run on the module code before its module_exit() return. This is still true even if no module_get/put(THIS_MODULE) is used. Today with all your help, I can see the fops.owner = THIS_MODULE is an efficient way to ensure no entry to file callbacks when module_exit() is called. > the last reference, no? > > The only exceptions I see are: > > 1. if module_put() is called via async_run_entry_fn(), as delete_module() invokes > async_synchronize_full() before unmapping the module. But IIUC, the async > framework uses workqueues, not the other way around. I.e. delete_module() > doesn't flush arbitrary workqueues. > > 2. if module_put() is called via module_put_and_kthread_exit(), which uses > THIS_MODULE but does module_put() from a core kernel helper and never returns > to the module's kthread, i.e. doesn't return to module code. > > But then this > > $ git grep -E "module_put\(THIS_MODULE" | wc -l > 132 > > make me go "huh"? I've blamed a handful of those calls, and I can't find a single > one that provides any clue as to why the module gets/puts references to itself, > let alone actually justifies the usage. > > E.g. drivers/block/loop.c has this gem > > /* This is safe: open() is still holding a reference. */ > module_put(THIS_MODULE); > > in __loop_clr_fd(), which is invoked from a .release() function. So open() quite > clearly doesn't hold a reference, unless the comment is talking about the reference > that was obtained by the core file systems layer and won't be put until after > .release() completes. But then what on earth is the point of doing > module_get(THIS_MODULE) and module_put(THIS_MODULE)? I see the comment that .release()->__loop_clr_fd() is only called when in "autoclear mode". Maybe in some "manual mode", user should explicitly call IOCTL(LOOP_CLR_FD) to allow module removal. That means in some case, user called IOCTL(LOOP_CONFIGURE) and then closed all fds, but the device state was not cleaned up, so that the driver module is not allowed to be removed. Thanks, Yilun > > > Back to KVM... > > Commit 5f6de5cbebee ("KVM: Prevent module exit until all VMs are freed") *tried* > to fix a bug where KVM-the-module could be unloaded while a KVM workqueue callback > was still in-flight. The callback had a reference to the VM, but not to the VM's > file representation. > > After that commit went in, I suggested dropping the use of .owner for VMs and > vCPUs (each of which is represented by an anon inode file) because keeping the VM > alive would pin KVM-the-module until all VMs went away. But I missed the > obvious-in-hindsight issue Al highlighted above. > > Fixing that particular wart is relatively easy: revert commit 70375c2d8fa3 ("Revert > "KVM: set owner of cpu and vm file operations""), and give all of the other KVM-owned > file_operations structures the same treatment by setting .owner correctly. Note, > "correctly" isn't THIS_MODULE in most cases, which is why the code existing is a > bit odd. For most file_operations, on x86 and PPC (and MIPS?), the effective owner > is actually a sub-module, e.g. THIS_MODULE will point at kvm.ko, but on x86 the > effective owner is either kvm-intel.ko or kvm-amd.ko (which holds a reference to > kvm.ko). > > After staring and fiddling for most of today, I finally discovered that grabbing > a reference to the module on behalf of the work item didn't fix the actual bugs, > plural, it just shuffled the deck chairs on the Titanic. And as above, it set us > up to make even bigger mistakes regarding .owner :-( > > The problematic code is effectively kvm_clear_async_pf_completion_queue(). That > helper is called for each vCPU when a VM is being destroyed, i.e. when the last > reference to a VM is put via kvm_put_kvm(). Clearing the queue *should* also > flush all work items, except it doesn't when the work is "done", where "done" just > means the page being faulted in is ready. Using file_operations.owner doesn't > solve anything, e.g. even if async_pf_execute() were gifted a reference to the > VM's file and used the deferred fput(), the same preemption issue exists, it's > just slightly harder to hit. > > The original async #PF code appears to have fudged around the lack of flushing by > gifting a VM reference to the async_pf_execute(). Or maybe it was the other way > around and not flushing was a workaround for the deadlock that occurs if > kvm_clear_async_pf_completion_queue() does flush the workqueue. If kvm_put_kvm() > is called from async_pf_execute() and kvm_put_kvm() flushes the async #PF workqueue, > deadlock occurs becase async_pf_execute() can't return until kvm_put_kvm() finishes, > and kvm_put_kvm() can't return until async_pf_execute() finishes. > > WARNING: CPU: 8 PID: 251 at virt/kvm/kvm_main.c:1435 kvm_put_kvm+0x2d/0x320 [kvm] > Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel kvm irqbypass > CPU: 8 PID: 251 Comm: kworker/8:1 Tainted: G W 6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 > Workqueue: events async_pf_execute [kvm] > RIP: 0010:kvm_put_kvm+0x2d/0x320 [kvm] > Call Trace: > <TASK> > async_pf_execute+0x198/0x260 [kvm] > process_one_work+0x145/0x2d0 > worker_thread+0x27e/0x3a0 > kthread+0xba/0xe0 > ret_from_fork+0x2d/0x50 > ret_from_fork_asm+0x11/0x20 > </TASK> > ---[ end trace 0000000000000000 ]--- > INFO: task kworker/8:1:251 blocked for more than 120 seconds. > Tainted: G W 6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:kworker/8:1 state:D stack:0 pid:251 ppid:2 flags:0x00004000 > Workqueue: events async_pf_execute [kvm] > Call Trace: > <TASK> > __schedule+0x33f/0xa40 > schedule+0x53/0xc0 > schedule_timeout+0x12a/0x140 > __wait_for_common+0x8d/0x1d0 > __flush_work.isra.0+0x19f/0x2c0 > kvm_clear_async_pf_completion_queue+0x129/0x190 [kvm] > kvm_arch_destroy_vm+0x78/0x1b0 [kvm] > kvm_put_kvm+0x1c1/0x320 [kvm] > async_pf_execute+0x198/0x260 [kvm] > process_one_work+0x145/0x2d0 > worker_thread+0x27e/0x3a0 > kthread+0xba/0xe0 > ret_from_fork+0x2d/0x50 > ret_from_fork_asm+0x11/0x20 > </TASK> > > If kvm_clear_async_pf_completion_queue() actually flushes the workqueue, then > there's no need to gift async_pf_execute() a reference because all invocations > of async_pf_execute() will be forced to complete before the vCPU and its VM are > destroyed/freed. And that also fixes the module unloading mess because __fput() > won't do module_put() on the last vCPU reference until the vCPU has been freed. > > The attached patches are lightly tested, but I think they fix the KVM mess. I > likely won't post a proper series until next week, I'm going to be offline the > next two days. > From 017fedee5608094f2e5535297443db7512a213b8 Mon Sep 17 00:00:00 2001 > From: Sean Christopherson <seanjc@xxxxxxxxxx> > Date: Tue, 10 Oct 2023 11:42:32 -0700 > Subject: [PATCH 1/3] KVM: Set file_operations.owner appropriately for all such > structures > > This reverts commit 70375c2d8fa3fb9b0b59207a9c5df1e2e1205c10, and gives > > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> > --- > arch/x86/kvm/debugfs.c | 1 + > virt/kvm/kvm_main.c | 11 ++++++++--- > 2 files changed, 9 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/kvm/debugfs.c b/arch/x86/kvm/debugfs.c > index ee8c4c3496ed..eea6ea7f14af 100644 > --- a/arch/x86/kvm/debugfs.c > +++ b/arch/x86/kvm/debugfs.c > @@ -182,6 +182,7 @@ static int kvm_mmu_rmaps_stat_release(struct inode *inode, struct file *file) > } > > static const struct file_operations mmu_rmaps_stat_fops = { > + .owner = THIS_MODULE, > .open = kvm_mmu_rmaps_stat_open, > .read = seq_read, > .llseek = seq_lseek, > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 486800a7024b..1e65a506985f 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -3887,7 +3887,7 @@ static int kvm_vcpu_release(struct inode *inode, struct file *filp) > return 0; > } > > -static const struct file_operations kvm_vcpu_fops = { > +static struct file_operations kvm_vcpu_fops = { > .release = kvm_vcpu_release, > .unlocked_ioctl = kvm_vcpu_ioctl, > .mmap = kvm_vcpu_mmap, > @@ -4081,6 +4081,7 @@ static int kvm_vcpu_stats_release(struct inode *inode, struct file *file) > } > > static const struct file_operations kvm_vcpu_stats_fops = { > + .owner = THIS_MODULE, > .read = kvm_vcpu_stats_read, > .release = kvm_vcpu_stats_release, > .llseek = noop_llseek, > @@ -4431,7 +4432,7 @@ static int kvm_device_release(struct inode *inode, struct file *filp) > return 0; > } > > -static const struct file_operations kvm_device_fops = { > +static struct file_operations kvm_device_fops = { > .unlocked_ioctl = kvm_device_ioctl, > .release = kvm_device_release, > KVM_COMPAT(kvm_device_ioctl), > @@ -4759,6 +4760,7 @@ static int kvm_vm_stats_release(struct inode *inode, struct file *file) > } > > static const struct file_operations kvm_vm_stats_fops = { > + .owner = THIS_MODULE, > .read = kvm_vm_stats_read, > .release = kvm_vm_stats_release, > .llseek = noop_llseek, > @@ -5060,7 +5062,7 @@ static long kvm_vm_compat_ioctl(struct file *filp, > } > #endif > > -static const struct file_operations kvm_vm_fops = { > +static struct file_operations kvm_vm_fops = { > .release = kvm_vm_release, > .unlocked_ioctl = kvm_vm_ioctl, > .llseek = noop_llseek, > @@ -6095,6 +6097,9 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module) > goto err_async_pf; > > kvm_chardev_ops.owner = module; > + kvm_vm_fops.owner = module; > + kvm_vcpu_fops.owner = module; > + kvm_device_fops.owner = module; > > kvm_preempt_ops.sched_in = kvm_sched_in; > kvm_preempt_ops.sched_out = kvm_sched_out; > > base-commit: dfdc8b7884b50e3bfa635292973b530a97689f12 > -- > 2.42.0.609.gbb76f46606-goog > > From f5be42f3be9967a0591051a7c8d73cac2c0a072b Mon Sep 17 00:00:00 2001 > From: Sean Christopherson <seanjc@xxxxxxxxxx> > Date: Tue, 10 Oct 2023 13:42:13 -0700 > Subject: [PATCH 2/3] KVM: Always flush async #PF workqueue when vCPU is being > destroyed > > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> > --- > virt/kvm/async_pf.c | 15 ++++++++++++--- > 1 file changed, 12 insertions(+), 3 deletions(-) > > diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c > index e033c79d528e..7aeb9d1f43b1 100644 > --- a/virt/kvm/async_pf.c > +++ b/virt/kvm/async_pf.c > @@ -87,7 +87,6 @@ static void async_pf_execute(struct work_struct *work) > __kvm_vcpu_wake_up(vcpu); > > mmput(mm); > - kvm_put_kvm(vcpu->kvm); > } > > void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu) > @@ -114,7 +113,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu) > #else > if (cancel_work_sync(&work->work)) { > mmput(work->mm); > - kvm_put_kvm(vcpu->kvm); /* == work->vcpu->kvm */ > kmem_cache_free(async_pf_cache, work); > } > #endif > @@ -126,7 +124,19 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu) > list_first_entry(&vcpu->async_pf.done, > typeof(*work), link); > list_del(&work->link); > + > + spin_unlock(&vcpu->async_pf.lock); > + > + /* > + * The async #PF is "done", but KVM must wait for the work item > + * itself, i.e. async_pf_execute(), to run to completion. If > + * KVM is a module, KVM must ensure *no* code owned by the KVM > + * (the module) can be run after the last call to module_put(), > + * i.e. after the last reference to the last vCPU's file is put. > + */ > + flush_work(&work->work); > kmem_cache_free(async_pf_cache, work); > + spin_lock(&vcpu->async_pf.lock); > } > spin_unlock(&vcpu->async_pf.lock); > > @@ -186,7 +196,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > work->arch = *arch; > work->mm = current->mm; > mmget(work->mm); > - kvm_get_kvm(work->vcpu->kvm); > > INIT_WORK(&work->work, async_pf_execute); > > -- > 2.42.0.609.gbb76f46606-goog > > From 0a4238f027e41c64afa2919440420ea56c0cae80 Mon Sep 17 00:00:00 2001 > From: Sean Christopherson <seanjc@xxxxxxxxxx> > Date: Tue, 10 Oct 2023 15:09:43 -0700 > Subject: [PATCH 3/3] Revert "KVM: Prevent module exit until all VMs are freed" > > Revert KVM's misguided attempt to "fix" a use-after-module-unload bug that > was actually due to failure to flush a workqueue, not a lack of module > refcounting. > > blah blah blah > > This reverts commit 405294f29faee5de8c10cb9d4a90e229c2835279 and commit > commit 5f6de5cbebee925a612856fce6f9182bb3eee0db. > > Fixes: 405294f29fae ("KVM: Unconditionally get a ref to /dev/kvm module when creating a VM") > Fixes: 5f6de5cbebee ("KVM: Prevent module exit until all VMs are freed") > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> > --- > virt/kvm/kvm_main.c | 7 ------- > 1 file changed, 7 deletions(-) > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 1e65a506985f..3b1b9e8dd70c 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -115,8 +115,6 @@ EXPORT_SYMBOL_GPL(kvm_debugfs_dir); > > static const struct file_operations stat_fops_per_vm; > > -static struct file_operations kvm_chardev_ops; > - > static long kvm_vcpu_ioctl(struct file *file, unsigned int ioctl, > unsigned long arg); > #ifdef CONFIG_KVM_COMPAT > @@ -1157,9 +1155,6 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname) > if (!kvm) > return ERR_PTR(-ENOMEM); > > - /* KVM is pinned via open("/dev/kvm"), the fd passed to this ioctl(). */ > - __module_get(kvm_chardev_ops.owner); > - > KVM_MMU_LOCK_INIT(kvm); > mmgrab(current->mm); > kvm->mm = current->mm; > @@ -1279,7 +1274,6 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname) > out_err_no_srcu: > kvm_arch_free_vm(kvm); > mmdrop(current->mm); > - module_put(kvm_chardev_ops.owner); > return ERR_PTR(r); > } > > @@ -1348,7 +1342,6 @@ static void kvm_destroy_vm(struct kvm *kvm) > preempt_notifier_dec(); > hardware_disable_all(); > mmdrop(mm); > - module_put(kvm_chardev_ops.owner); > } > > void kvm_get_kvm(struct kvm *kvm) > -- > 2.42.0.609.gbb76f46606-goog >