[RFC] potential UAF in kvm_spapr_tce_attach_iommu_group() (was Re: [PATCH 11/19] switch simple users of fdget() to CLASS(fd, ...))

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Mon, 10 Jun 2024 03:44:37 +0100

On Fri, Jun 07, 2024 at 10:08:14PM +0100, Al Viro wrote:

> Hell knows - it feels like mixing __cleanup-based stuff with anything
> explicit leads to massive headache.  And I *really* hate to have
> e.g. inode_unlock() hidden in __cleanup in a random subset of places.
> Unlike dropping file references (if we do that a bit later, nothing
> would really care), the loss of explicit control over the places where
> inode lock is dropped is asking for serious trouble.
> 
> Any suggestions?  Linus, what's your opinion on the use of CLASS...
> stuff?

While looking through the converted fdget() users, some interesting
stuff got caught.  Example:

kvm_device_fops.unlocked_ioctl() is equal to kvm_device_ioctl() and it
gets called (by ioctl(2)) without any locks.

kvm_device_ioctl() calls kvm_device_ioctl_attr(), passing it dev->ops->set_attr.

kvm_device_ioctl_attr() calls the callback passed to it, still without any
locks.

->set_attr() can be kvm_vfio_set_attr(), which calls kvm_vfio_set_file(), which
calls kvm_vfio_file_set_spapr_tce(), which takes dev->private.lock and
calls kvm_spapr_tce_attach_iommu_group().  No kvm->lock held.

Now, in kvm_spapr_tce_attach_iommu_group() we have (in mainline)
        f = fdget(tablefd);
        if (!f.file)
                return -EBADF;

        rcu_read_lock();
        list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
                if (stt == f.file->private_data) {
                        found = true;
                        break;
                }
        }
        rcu_read_unlock();

        fdput(f);

        if (!found)
                return -EINVAL;

	....
        list_add_rcu(&stit->next, &stt->iommu_tables);

What happens if another thread closes the damn descriptor just as we'd
done fdput()?  This:
static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
{
        struct kvmppc_spapr_tce_table *stt = filp->private_data;
        struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
        struct kvm *kvm = stt->kvm;

        mutex_lock(&kvm->lock);
        list_del_rcu(&stt->list);
        mutex_unlock(&kvm->lock);

        list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
                WARN_ON(!kref_read(&stit->kref));
                while (1) {
                        if (kref_put(&stit->kref, kvm_spapr_tce_liobn_put))
                                break;
                }
        }

        account_locked_vm(kvm->mm,
                kvmppc_stt_pages(kvmppc_tce_pages(stt->size)), false);

        kvm_put_kvm(stt->kvm);

        call_rcu(&stt->rcu, release_spapr_tce_table);

        return 0;
}

Leaving aside the question of sanity of that while (!kfref_put()) loop,
that function will *NOT* block on kvm->lock (the only lock being held
by the caller of kvm_spapr_tce_attach_iommu_group() is struct kvm_vfio::lock,
not struct kvm::lock) and it will arrange for RCU-delayed call of
release_spapr_tce_table(), which will kfree stt.

Recall that in kvm_spapr_tce_attach_iommu_group() we are not holding
rcu_read_lock() between fdput() and list_add_rcu() (we couldn't - not
with the blocking allocations we have there), so call_rcu() might as
well have been a direct call.

What's there to protect stt from being freed right after fdput()?

Unless I'm misreading that code (entirely possible), this fdput() shouldn't
be done until we are done with stt.