Re: [RFC] UAF in acrn_irqfd_assign() and vfio_virqfd_enable()

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Wed, 12 Jun 2024 03:16:53 +0100

On Tue, Jun 11, 2024 at 05:04:38PM -0600, Alex Williamson wrote:

> > OK, so the memory safety in there depends upon
> > 	* external exclusion wrt vfio_virqfd_disable() on caller-specific
> > locks (vfio_pci_core_device::ioeventfds_lock for vfio_pci_rdwr.c,
> > vfio_pci_core_device::igate for the rest?  What about the path via
> > vfio_pci_core_disable()?)
> 
> This is only called when the device is closed, therefore there's no
> userspace access to generate a race.

Umm...  Let's see if I got confused in RTFS:

	1.  Calls of vfio_pci_core_disable() come from assorted ->close_device()
instances and failure exits in ->open_device() ones.
	2.  ->open_device() is called by vfio_df_device_first_open() from
vfio_df_open().  That's done under device->dev_set->lock.
	3.  ->close_device() is called by vfio_df_device_last_close() from
vfio_df_close(), under the same lock.
	4.  vfio_df_open() comes from vfio_df_ioctl_bind_iommufd() or from
vfio_df_group_open().  vfio_df_close() is done by the failure exits in those
two, as well as by vfio_df_unbind_iommufd() and vfio_df_group_close().
	5.  vfio_df_bind_iommufd() handles VFIO_DEVICE_BIND_IOMMUFD in
vfio_device_fops->unlocked_ioctl(); only works for !df->group and only
once, unless I'm misreading vfio_df_open().  No other ioctls are allowed
until that's done and vfio_df_unbind_iommufd() is done in ->release(),
in case of !df->group.  vfio_df_close() is done there, in case we had
a successful BIND_IOMMUFD done at some point.  Multiple files can be opened
for the same device; once one of them have done BIND_IOMMUFD, BIND_IOMMUFD
on any of them will fail until the first caller gets closed.  Once that
happens, others can get BIND_IOMMUFD; until then ioctls don't work for
them at all (IOW, BIND_IOMMUFD grants the ability to do ioctls only for
the opened file it had been done on).
	6.  vfio_df_group_open() and vfio_df_close() is about the other
way to get files with such ->f_op - VFIO_GROUP_GET_DEVICE_FD handling
in vfio_group_fops.unlocked_ioctl().  That gets an anon-inode file
with vfio_device_fops and shoves it into descriptor table.  Presumably
vfio_device_get_from_name() always returns a device with non-NULL
device->group (it would better, or the things would get really confused).
vfio_df_group_open() is done fist, with vfio_df_group_close() on failure
exit *and* in ->release() of those suckers (again, assuming we do get
non-NULL ->group).

	OK, that seems to be enough - anything done in ->ioctl() would
be completed between vfio_df_open() and vfio_df_close(), so we do have
the exclusion.	Is the above correct?

FWIW, this was a confusing bit:
        vfio_pci_core_disable(vdev);

        mutex_lock(&vdev->igate);
        if (vdev->err_trigger) {
                eventfd_ctx_put(vdev->err_trigger);
                vdev->err_trigger = NULL;
        }
        if (vdev->req_trigger) {
                eventfd_ctx_put(vdev->req_trigger);
                vdev->req_trigger = NULL;
        }
        mutex_unlock(&vdev->igate);
in vfio_pci_core_close_device().  Since we need an exclusion on ->igate
there for something, and since it's one of the locks used to serialize
vfio_virqfd_enable()...

> > 	* no EPOLLHUP on eventfd while the file is pinned.  That's what
> >         /*
> >          * Do not drop the file until the irqfd is fully initialized,
> >          * otherwise we might race against the EPOLLHUP.
> >          */
> > in there (that "irqfd" is a typo for "kirqfd", right?) refers to.
> 
> Sorry, I'm not fully grasping your comment.  "irqfd" is not a typo
> here, "kirqfd" seems to be a Xen thing.  I believe the comment is
> referring to holding a reference to the fd until everything is in place
> to cleanup correctly if the user process is killed mid-setup.  Thanks,

*blink*

s/kirqfd/virqfd/, sorry.  In the comment earlier in the same function:
         * virqfds can be released by closing the eventfd or directly
	   ^^^^^^^
         * through ioctl.  These are both done through a workqueue, so
         * we update the pointer to the virqfd under lock to avoid
				    ^^^ ^^^^^^
         * pushing multiple jobs to release the same virqfd.
					    ^^^ ^^^^ ^^^^^^
In the second comment you have
         * Do not drop the file until the irqfd is fully initialized,
				      ^^^ ^^^^^
         * otherwise we might race against the EPOLLHUP.

	If that refers to the same object, the comment makes sense - once
you've called vfs_poll(), EPOLLHUP wakeup would have your new instance of
struct virqfd freed, so accessing it (e.g. in
                        schedule_work(&virqfd->inject);
) is only safe because EPOLLHUP won't come until eventfd_release(), which
will not happen as long as you don't drop the file reference that sits in
irqfd.file.  That's the reason why struct fd instance can't be released
until you are done with setting the struct virqfd instance up.

	If that reading is what you intended, "irqfd" in the second
comment ought to be "virqfd", to be consistent with the reference to the
same thing in the earlier comment.  If that's not what you meant... what
is that comment really about?  Killing the user process mid-setup won't
actually do anything until your thread is out of whatever syscall it
had been in (ioctl(2), usually); the dangerous scenario would be having
another thread close the same descriptor after you've done fdput().

	The thing you need to avoid is having all references to
eventfd file dropped.  For that the reference in descriptor table must
be gone.  Sure, killing the process might do that - once all threads
get to exit_files() and drop their references to descriptor table.
Then put_files_struct() from the last of them will call close_files()
and drop all file references you had in the table.

	But that can happen only when all threads have gotten through
the signal delivery.  Including the one that was in the middle of
vfio_virqfd_enable().  And _that_ won't happen until return from that
function.

	So having the caller killed mid-setup is not an issue.	Another
thread sharing the same descriptor table and calling close(fd) (or
dup2(something, fd), or anything else that would close that descriptor)
would be.  _That_ is what is prevented by the fdget()/fdput() pair -
between those we are guaranteed that file reference will stay pinned.
If descriptor table is shared, fdget() will clone the reference and store
it in irqfd.file, so the file remains pinned no matter what happens
to descriptor table, until fdput() drops the reference.  If the table
is _not_ shared, the reference in it won't go away until we are done,
so we can borrow that into irqfd.file (and do nothing on fdput()).

	Anyway, I believe that what you have there is actually safe.
Analysis could be less convoluted, but then I might've missed simpler
reasons why everything works.

	It really needs comments in there - as it is, two drivers have
copied that scheme without bothering with exclusion (commit f8941e6c4c71
"xen: privcmd: Add support for irqfd" last year and commit aa3b483ff1d7
"virt: acrn: Introduce irqfd" three years ago) with, AFAICT, real UAF
in each ;-/