RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

"Liu, Yi L" <yi.l.liu@xxxxxxxxx> · Tue, 31 Mar 2020 13:22:03 +0000

> From: Tian, Kevin <kevin.tian@xxxxxxxxx>
> Sent: Tuesday, March 31, 2020 1:41 PM
> To: Liu, Yi L <yi.l.liu@xxxxxxxxx>; alex.williamson@xxxxxxxxxx;
> eric.auger@xxxxxxxxxx
> Subject: RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
> 
> > From: Liu, Yi L <yi.l.liu@xxxxxxxxx>
> > Sent: Monday, March 30, 2020 10:37 PM
> >
> > > From: Tian, Kevin <kevin.tian@xxxxxxxxx>
> > > Sent: Monday, March 30, 2020 4:32 PM
> > > To: Liu, Yi L <yi.l.liu@xxxxxxxxx>; alex.williamson@xxxxxxxxxx;
> > > Subject: RE: [PATCH v1 1/8] vfio: Add
> > VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > >
> > > > From: Liu, Yi L <yi.l.liu@xxxxxxxxx>
> > > > Sent: Sunday, March 22, 2020 8:32 PM
> > > >
> > > > From: Liu Yi L <yi.l.liu@xxxxxxxxx>
> > > >
> > > > For a long time, devices have only one DMA address space from
> > > > platform IOMMU's point of view. This is true for both bare metal
> > > > and directed- access in virtualization environment. Reason is the
> > > > source ID of DMA in PCIe are BDF (bus/dev/fnc ID), which results
> > > > in only device granularity
> > >
> > > are->is
> >
> > thanks.
> >
> > > > DMA isolation. However, this is changing with the latest
> > > > advancement in I/O technology area. More and more platform vendors
> > > > are utilizing the
> > PCIe
> > > > PASID TLP prefix in DMA requests, thus to give devices with
> > > > multiple DMA address spaces as identified by their individual
> > > > PASIDs. For example, Shared Virtual Addressing (SVA, a.k.a Shared
> > > > Virtual Memory) is able to let device access multiple process
> > > > virtual address space by binding the
> > >
> > > "address space" -> "address spaces"
> > >
> > > "binding the" -> "binding each"
> >
> > will correct both.
> >
> > > > virtual address space with a PASID. Wherein the PASID is allocated
> > > > in software and programmed to device per device specific manner.
> > > > Devices which support PASID capability are called PASID-capable
> > > > devices. If such devices are passed through to VMs, guest software
> > > > are also able to bind guest process virtual address space on such
> > > > devices. Therefore, the guest software could reuse the bare metal
> > > > software programming model,
> > which
> > > > means guest software will also allocate PASID and program it to
> > > > device directly. This is a dangerous situation since it has
> > > > potential PASID conflicts and unauthorized address space access.
> > > > It would be safer to let host intercept in the guest software's
> > > > PASID allocation. Thus PASID are managed system-wide.
> > > >
> > > > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to
> > > > passdown PASID allocation/free request from the virtual IOMMU.
> > > > Additionally, such
> > >
> > > "Additionally, because such"
> > >
> > > > requests are intended to be invoked by QEMU or other applications
> > which
> > >
> > > simplify to "intended to be invoked from userspace"
> >
> > got it.
> >
> > > > are running in userspace, it is necessary to have a mechanism to
> > > > prevent single application from abusing available PASIDs in
> > > > system. With such consideration, this patch tracks the VFIO PASID
> > > > allocation per-VM. There was a discussion to make quota to be per
> > > > assigned devices. e.g. if a VM has many assigned devices, then it
> > > > should have more quota. However, it is not sure how many PASIDs an
> > > > assigned devices will use. e.g. it is
> > >
> > > devices -> device
> >
> > got it.
> >
> > > > possible that a VM with multiples assigned devices but requests
> > > > less PASIDs. Therefore per-VM quota would be better.
> > > >
> > > > This patch uses struct mm pointer as a per-VM token. We also
> > > > considered using task structure pointer and vfio_iommu structure
> > > > pointer. However, task structure is per-thread, which means it
> > > > cannot achieve per-VM PASID alloc tracking purpose. While for
> > > > vfio_iommu structure, it is visible only within vfio. Therefore,
> > > > structure mm pointer is selected. This patch adds a structure
> > > > vfio_mm. A vfio_mm is created when the first vfio container is
> > > > opened by a VM. On the reverse order, vfio_mm is free when the
> > > > last vfio container is released. Each VM is assigned with a PASID
> > > > quota, so that it is not able to request PASID beyond its quota.
> > > > This patch adds a default quota of 1000. This quota could be tuned
> > > > by administrator. Making PASID quota tunable will be added in
> > > > another
> > patch
> > > > in this series.
> > > >
> > > > Previous discussions:
> > > > https://patchwork.kernel.org/patch/11209429/
> > > >
> > > > Cc: Kevin Tian <kevin.tian@xxxxxxxxx>
> > > > CC: Jacob Pan <jacob.jun.pan@xxxxxxxxxxxxxxx>
> > > > Cc: Alex Williamson <alex.williamson@xxxxxxxxxx>
> > > > Cc: Eric Auger <eric.auger@xxxxxxxxxx>
> > > > Cc: Jean-Philippe Brucker <jean-philippe@xxxxxxxxxx>
> > > > Signed-off-by: Liu Yi L <yi.l.liu@xxxxxxxxx>
> > > > Signed-off-by: Yi Sun <yi.y.sun@xxxxxxxxxxxxxxx>
> > > > Signed-off-by: Jacob Pan <jacob.jun.pan@xxxxxxxxxxxxxxx>
> > > > ---
> > > >  drivers/vfio/vfio.c             | 130
> > > > ++++++++++++++++++++++++++++++++++++++++
> > > >  drivers/vfio/vfio_iommu_type1.c | 104
> > > > ++++++++++++++++++++++++++++++++
> > > >  include/linux/vfio.h            |  20 +++++++
> > > >  include/uapi/linux/vfio.h       |  41 +++++++++++++
> > > >  4 files changed, 295 insertions(+)
> > > >
> > > > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index
> > > > c848262..d13b483 100644
> > > > --- a/drivers/vfio/vfio.c
> > > > +++ b/drivers/vfio/vfio.c
> > > > @@ -32,6 +32,7 @@
> > > >  #include <linux/vfio.h>
> > > >  #include <linux/wait.h>
> > > >  #include <linux/sched/signal.h>
> > > > +#include <linux/sched/mm.h>
> > > >
> > > >  #define DRIVER_VERSION	"0.3"
> > > >  #define DRIVER_AUTHOR	"Alex Williamson
> > > > <alex.williamson@xxxxxxxxxx>"
> > > > @@ -46,6 +47,8 @@ static struct vfio {
> > > >  	struct mutex			group_lock;
> > > >  	struct cdev			group_cdev;
> > > >  	dev_t				group_devt;
> > > > +	struct list_head		vfio_mm_list;
> > > > +	struct mutex			vfio_mm_lock;
> > > >  	wait_queue_head_t		release_q;
> > > >  } vfio;
> > > >
> > > > @@ -2129,6 +2132,131 @@ int vfio_unregister_notifier(struct device
> > *dev,
> > > > enum vfio_notify_type type,
> > > >  EXPORT_SYMBOL(vfio_unregister_notifier);
> > > >
> > > >  /**
> > > > + * VFIO_MM objects - create, release, get, put, search
> > >
> > > why capitalizing vfio_mm?
> >
> > oops, it's not intended, will fix it.
> >
> > > > + * Caller of the function should have held vfio.vfio_mm_lock.
> > > > + */
> > > > +static struct vfio_mm *vfio_create_mm(struct mm_struct *mm) {
> > > > +	struct vfio_mm *vmm;
> > > > +	struct vfio_mm_token *token;
> > > > +	int ret = 0;
> > > > +
> > > > +	vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
> > > > +	if (!vmm)
> > > > +		return ERR_PTR(-ENOMEM);
> > > > +
> > > > +	/* Per mm IOASID set used for quota control and group operations
> > > > */
> > > > +	ret = ioasid_alloc_set((struct ioasid_set *) mm,
> > > > +			       VFIO_DEFAULT_PASID_QUOTA, &vmm-
> > > > >ioasid_sid);
> > > > +	if (ret) {
> > > > +		kfree(vmm);
> > > > +		return ERR_PTR(ret);
> > > > +	}
> > > > +
> > > > +	kref_init(&vmm->kref);
> > > > +	token = &vmm->token;
> > > > +	token->val = mm;
> > > > +	vmm->pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
> > > > +	mutex_init(&vmm->pasid_lock);
> > > > +
> > > > +	list_add(&vmm->vfio_next, &vfio.vfio_mm_list);
> > > > +
> > > > +	return vmm;
> > > > +}
> > > > +
> > > > +static void vfio_mm_unlock_and_free(struct vfio_mm *vmm) {
> > > > +	/* destroy the ioasid set */
> > > > +	ioasid_free_set(vmm->ioasid_sid, true);
> > >
> > > do we need hold pasid lock here, since it attempts to destroy a set
> > > which might be referenced by vfio_mm_pasid_free? or is there
> > > guarantee that such race won't happen?
> >
> > Emmm, if considering the race between ioasid_free_set and
> > vfio_mm_pasid_free, I guess ioasid core should sequence the two
> > operations with its internal lock. right?
> 
> I looked at below code in free path:
> 
> +	mutex_lock(&vmm->pasid_lock);
> +	pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
> +	if (IS_ERR(pdata)) {
> +		ret = PTR_ERR(pdata);
> +		goto out_unlock;
> +	}
> +	ioasid_free(pasid);
> +
> +out_unlock:
> +	mutex_unlock(&vmm->pasid_lock);
> 
> above implies that ioasid_find/free must be paired within the pasid_lock.
> Then if we don't hold pasid_lock above, ioasid_free_set could happen between
> find/free. I'm not sure whether this race would lead to real problem, but it doesn't
> look correct simply by looking at this file.

Well, Jacob told me to remove the ioasid_find in another email as he
believes ioasid core should be able to take care of it. and also need to
be protected by lock. If so, does it look good? :-)

 " [Jacob Pan] this might be better to put under ioasid code such that it 
  is under the ioasid lock. no one else can free the ioasid between find() and free().
  e.g. ioasid_free(sid, pasid)
  if sid == INVALID_IOASID_SET, then no set ownership checking.
  thoughts?"

> >
> > > > +	mutex_unlock(&vfio.vfio_mm_lock);
> > > > +	kfree(vmm);
> > > > +}
> > > > +
> > > > +/* called with vfio.vfio_mm_lock held */ static void
> > > > +vfio_mm_release(struct kref *kref) {
> > > > +	struct vfio_mm *vmm = container_of(kref, struct vfio_mm, kref);
> > > > +
> > > > +	list_del(&vmm->vfio_next);
> > > > +	vfio_mm_unlock_and_free(vmm);
> > > > +}
> > > > +
> > > > +void vfio_mm_put(struct vfio_mm *vmm) {
> > > > +	kref_put_mutex(&vmm->kref, vfio_mm_release,
> > > > &vfio.vfio_mm_lock);
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(vfio_mm_put);
> > > > +
> > > > +/* Assume vfio_mm_lock or vfio_mm reference is held */ static
> > > > +void vfio_mm_get(struct vfio_mm *vmm) {
> > > > +	kref_get(&vmm->kref);
> > > > +}
> > > > +
> > > > +struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task) {
> > > > +	struct mm_struct *mm = get_task_mm(task);
> > > > +	struct vfio_mm *vmm;
> > > > +	unsigned long long val = (unsigned long long) mm;
> > > > +
> > > > +	mutex_lock(&vfio.vfio_mm_lock);
> > > > +	list_for_each_entry(vmm, &vfio.vfio_mm_list, vfio_next) {
> > > > +		if (vmm->token.val == val) {
> > > > +			vfio_mm_get(vmm);
> > > > +			goto out;
> > > > +		}
> > > > +	}
> > > > +
> > > > +	vmm = vfio_create_mm(mm);
> > > > +	if (IS_ERR(vmm))
> > > > +		vmm = NULL;
> > > > +out:
> > > > +	mutex_unlock(&vfio.vfio_mm_lock);
> > > > +	mmput(mm);
> > >
> > > I assume this has been discussed before, but from readability p.o.v
> > > it might be good to add a comment for this function to explain how
> > > the recording of mm in vfio_mm can be correctly removed when the mm
> > > is being destroyed, since we don't hold a reference of mm here.
> >
> > yeah, I'll add it.
> >
> > > > +	return vmm;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> > > > +
> > > > +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max) {
> > > > +	ioasid_t pasid;
> > > > +	int ret = -ENOSPC;
> > > > +
> > > > +	mutex_lock(&vmm->pasid_lock);
> > > > +
> > > > +	pasid = ioasid_alloc(vmm->ioasid_sid, min, max, NULL);
> > > > +	if (pasid == INVALID_IOASID) {
> > > > +		ret = -ENOSPC;
> > > > +		goto out_unlock;
> > > > +	}
> > > > +
> > > > +	ret = pasid;
> > > > +out_unlock:
> > > > +	mutex_unlock(&vmm->pasid_lock);
> > > > +	return ret;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_alloc);
> > > > +
> > > > +int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid) {
> > > > +	void *pdata;
> > > > +	int ret = 0;
> > > > +
> > > > +	mutex_lock(&vmm->pasid_lock);
> > > > +	pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
> > > > +	if (IS_ERR(pdata)) {
> > > > +		ret = PTR_ERR(pdata);
> > > > +		goto out_unlock;
> > > > +	}
> > > > +	ioasid_free(pasid);
> > > > +
> > > > +out_unlock:
> > > > +	mutex_unlock(&vmm->pasid_lock);
> > > > +	return ret;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_free);
> > > > +
> > > > +/**
> > > >   * Module/class support
> > > >   */
> > > >  static char *vfio_devnode(struct device *dev, umode_t *mode) @@
> > > > -2151,8 +2279,10 @@ static int __init vfio_init(void)
> > > >  	idr_init(&vfio.group_idr);
> > > >  	mutex_init(&vfio.group_lock);
> > > >  	mutex_init(&vfio.iommu_drivers_lock);
> > > > +	mutex_init(&vfio.vfio_mm_lock);
> > > >  	INIT_LIST_HEAD(&vfio.group_list);
> > > >  	INIT_LIST_HEAD(&vfio.iommu_drivers_list);
> > > > +	INIT_LIST_HEAD(&vfio.vfio_mm_list);
> > > >  	init_waitqueue_head(&vfio.release_q);
> > > >
> > > >  	ret = misc_register(&vfio_dev);
> > > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > > b/drivers/vfio/vfio_iommu_type1.c index a177bf2..331ceee 100644
> > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > @@ -70,6 +70,7 @@ struct vfio_iommu {
> > > >  	unsigned int		dma_avail;
> > > >  	bool			v2;
> > > >  	bool			nesting;
> > > > +	struct vfio_mm		*vmm;
> > > >  };
> > > >
> > > >  struct vfio_domain {
> > > > @@ -2018,6 +2019,7 @@ static void
> > vfio_iommu_type1_detach_group(void
> > > > *iommu_data,
> > > >  static void *vfio_iommu_type1_open(unsigned long arg)  {
> > > >  	struct vfio_iommu *iommu;
> > > > +	struct vfio_mm *vmm = NULL;
> > > >
> > > >  	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> > > >  	if (!iommu)
> > > > @@ -2043,6 +2045,10 @@ static void
> > *vfio_iommu_type1_open(unsigned
> > > > long arg)
> > > >  	iommu->dma_avail = dma_entry_limit;
> > > >  	mutex_init(&iommu->lock);
> > > >  	BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
> > > > +	vmm = vfio_mm_get_from_task(current);
> > > > +	if (!vmm)
> > > > +		pr_err("Failed to get vfio_mm track\n");
> > >
> > > I assume error should be returned when pr_err is used...
> >
> > got it. I didn't do it as I don’t think vfio_mm is necessary for
> > every iommu open. It is necessary for the nesting type iommu. I'll
> > make it fetch vmm when it is opening nesting type and return error
> > if failed.
> 
> sounds good.
> 
> >
> > > > +	iommu->vmm = vmm;
> > > >
> > > >  	return iommu;
> > > >  }
> > > > @@ -2084,6 +2090,8 @@ static void vfio_iommu_type1_release(void
> > > > *iommu_data)
> > > >  	}
> > > >
> > > >  	vfio_iommu_iova_free(&iommu->iova_list);
> > > > +	if (iommu->vmm)
> > > > +		vfio_mm_put(iommu->vmm);
> > > >
> > > >  	kfree(iommu);
> > > >  }
> > > > @@ -2172,6 +2180,55 @@ static int vfio_iommu_iova_build_caps(struct
> > > > vfio_iommu *iommu,
> > > >  	return ret;
> > > >  }
> > > >
> > > > +static bool vfio_iommu_type1_pasid_req_valid(u32 flags)
> > >
> > > I don't think you need prefix "vfio_iommu_type1" for every new
> > > function here, especially for leaf internal function as this one.
> >
> > got it. thanks.
> >
> > > > +{
> > > > +	return !((flags & ~VFIO_PASID_REQUEST_MASK) ||
> > > > +		 (flags & VFIO_IOMMU_PASID_ALLOC &&
> > > > +		  flags & VFIO_IOMMU_PASID_FREE));
> > > > +}
> > > > +
> > > > +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> > > > +					 int min,
> > > > +					 int max)
> > > > +{
> > > > +	struct vfio_mm *vmm = iommu->vmm;
> > > > +	int ret = 0;
> > > > +
> > > > +	mutex_lock(&iommu->lock);
> > > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > > +		ret = -EFAULT;
> > >
> > > why -EFAULT?
> >
> > well, it's from a prior comment as below:
> >   vfio_mm_pasid_alloc() can return -ENOSPC though, so it'd be nice to
> >   differentiate the errors. We could use EFAULT for the no IOMMU case
> >   and EINVAL here?
> > http://lkml.iu.edu/hypermail/linux/kernel/2001.3/05964.html
> >
> > >
> > > > +		goto out_unlock;
> > > > +	}
> > > > +	if (vmm)
> > > > +		ret = vfio_mm_pasid_alloc(vmm, min, max);
> > > > +	else
> > > > +		ret = -EINVAL;
> > > > +out_unlock:
> > > > +	mutex_unlock(&iommu->lock);
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> > > > +				       unsigned int pasid)
> > > > +{
> > > > +	struct vfio_mm *vmm = iommu->vmm;
> > > > +	int ret = 0;
> > > > +
> > > > +	mutex_lock(&iommu->lock);
> > > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > > +		ret = -EFAULT;
> > >
> > > ditto
> > >
> > > > +		goto out_unlock;
> > > > +	}
> > > > +
> > > > +	if (vmm)
> > > > +		ret = vfio_mm_pasid_free(vmm, pasid);
> > > > +	else
> > > > +		ret = -EINVAL;
> > > > +out_unlock:
> > > > +	mutex_unlock(&iommu->lock);
> > > > +	return ret;
> > > > +}
> > > > +
> > > >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> > > >  				   unsigned int cmd, unsigned long arg)
> > > >  {
> > > > @@ -2276,6 +2333,53 @@ static long vfio_iommu_type1_ioctl(void
> > > > *iommu_data,
> > > >
> > > >  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
> > > >  			-EFAULT : 0;
> > > > +
> > > > +	} else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
> > > > +		struct vfio_iommu_type1_pasid_request req;
> > > > +		unsigned long offset;
> > > > +
> > > > +		minsz = offsetofend(struct vfio_iommu_type1_pasid_request,
> > > > +				    flags);
> > > > +
> > > > +		if (copy_from_user(&req, (void __user *)arg, minsz))
> > > > +			return -EFAULT;
> > > > +
> > > > +		if (req.argsz < minsz ||
> > > > +		    !vfio_iommu_type1_pasid_req_valid(req.flags))
> > > > +			return -EINVAL;
> > > > +
> > > > +		if (copy_from_user((void *)&req + minsz,
> > > > +				   (void __user *)arg + minsz,
> > > > +				   sizeof(req) - minsz))
> > > > +			return -EFAULT;
> > >
> > > why copying in two steps instead of copying them together?
> >
> > just want to do sanity check before copying all the data. I
> > can move it as one copy if it's better. :-)
> 
> it's possible fine. I just saw you did same thing for other uapis.
> 
> >
> > > > +
> > > > +		switch (req.flags & VFIO_PASID_REQUEST_MASK) {
> > > > +		case VFIO_IOMMU_PASID_ALLOC:
> > > > +		{
> > > > +			int ret = 0, result;
> > > > +
> > > > +			result = vfio_iommu_type1_pasid_alloc(iommu,
> > > > +							req.alloc_pasid.min,
> > > > +							req.alloc_pasid.max);
> > > > +			if (result > 0) {
> > > > +				offset = offsetof(
> > > > +					struct
> > > > vfio_iommu_type1_pasid_request,
> > > > +					alloc_pasid.result);
> > > > +				ret = copy_to_user(
> > > > +					      (void __user *) (arg + offset),
> > > > +					      &result, sizeof(result));
> > > > +			} else {
> > > > +				pr_debug("%s: PASID alloc failed\n",
> > > > __func__);
> > > > +				ret = -EFAULT;
> > >
> > > no, this branch is not for copy_to_user error. it is about pasid alloc
> > > failure. you should handle both.
> >
> > Emmm, I just want to fail the IOCTL in such case, so the @result field
> > is meaningless in the user side. How about using another return value
> > (e.g. ENOSPC) to indicate the IOCTL failure?
> 
> If pasid_alloc fails, you return its result to userspace
> if copy_to_user fails, then return -EFAULT.
> 
> however, above you return -EFAULT for pasid_alloc failure, and
> then the number of not-copied bytes for copy_to_user.

not quite get. Let me re-paste the code. :-)

+		case VFIO_IOMMU_PASID_ALLOC:
+		{
+			int ret = 0, result;
+
+			result = vfio_iommu_type1_pasid_alloc(iommu,
+							req.alloc_pasid.min,
+							req.alloc_pasid.max);
+			if (result > 0) {
+				offset = offsetof(
+					struct vfio_iommu_type1_pasid_request,
+					alloc_pasid.result);
+				ret = copy_to_user(
+					      (void __user *) (arg + offset),
+					      &result, sizeof(result));
if copy_to_user failed, ret is the number of uncopied bytes and
will be returned to userspace to indicate failure. userspace will
not use the data in result field. perhaps, I should check the ret
here and return -EFAULT or another errno, instead of return the
number of un-copied bytes.
+			} else {
+				pr_debug("%s: PASID alloc failed\n", __func__);
+				ret = -EFAULT;
if vfio_iommu_type1_pasid_alloc() failed, no doubt, return -EFAULT
to userspace to indicate failure.
+			}
+			return ret;
+		}

is there still porblem here?
> >
> > > > +			}
> > > > +			return ret;
> > > > +		}
> > > > +		case VFIO_IOMMU_PASID_FREE:
> > > > +			return vfio_iommu_type1_pasid_free(iommu,
> > > > +							   req.free_pasid);
> > > > +		default:
> > > > +			return -EINVAL;
> > > > +		}
> > > >  	}
> > > >
> > > >  	return -ENOTTY;
> > > > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > > > index e42a711..75f9f7f1 100644
> > > > --- a/include/linux/vfio.h
> > > > +++ b/include/linux/vfio.h
> > > > @@ -89,6 +89,26 @@ extern int vfio_register_iommu_driver(const struct
> > > > vfio_iommu_driver_ops *ops);
> > > >  extern void vfio_unregister_iommu_driver(
> > > >  				const struct vfio_iommu_driver_ops *ops);
> > > >
> > > > +#define VFIO_DEFAULT_PASID_QUOTA	1000
> > > > +struct vfio_mm_token {
> > > > +	unsigned long long val;
> > > > +};
> > > > +
> > > > +struct vfio_mm {
> > > > +	struct kref			kref;
> > > > +	struct vfio_mm_token		token;
> > > > +	int				ioasid_sid;
> > > > +	/* protect @pasid_quota field and pasid allocation/free */
> > > > +	struct mutex			pasid_lock;
> > > > +	int				pasid_quota;
> > > > +	struct list_head		vfio_next;
> > > > +};
> > > > +
> > > > +extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct
> > *task);
> > > > +extern void vfio_mm_put(struct vfio_mm *vmm);
> > > > +extern int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max);
> > > > +extern int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid);
> > > > +
> > > >  /*
> > > >   * External user API
> > > >   */
> > > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > > index 9e843a1..298ac80 100644
> > > > --- a/include/uapi/linux/vfio.h
> > > > +++ b/include/uapi/linux/vfio.h
> > > > @@ -794,6 +794,47 @@ struct vfio_iommu_type1_dma_unmap {
> > > >  #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
> > > >  #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
> > > >
> > > > +/*
> > > > + * PASID (Process Address Space ID) is a PCIe concept which
> > > > + * has been extended to support DMA isolation in fine-grain.
> > > > + * With device assigned to user space (e.g. VMs), PASID alloc
> > > > + * and free need to be system wide. This structure defines
> > > > + * the info for pasid alloc/free between user space and kernel
> > > > + * space.
> > > > + *
> > > > + * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @alloc_pasid
> > > > + * @flag=VFIO_IOMMU_PASID_FREE, refer to @free_pasid
> > > > + */
> > > > +struct vfio_iommu_type1_pasid_request {
> > > > +	__u32	argsz;
> > > > +#define VFIO_IOMMU_PASID_ALLOC	(1 << 0)
> > > > +#define VFIO_IOMMU_PASID_FREE	(1 << 1)
> > > > +	__u32	flags;
> > > > +	union {
> > > > +		struct {
> > > > +			__u32 min;
> > > > +			__u32 max;
> > > > +			__u32 result;
> > >
> > > result->pasid?
> >
> > yes, the pasid allocated.
> >
> > >
> > > > +		} alloc_pasid;
> > > > +		__u32 free_pasid;
> > >
> > > what about putting a common pasid field after flags?
> >
> > looks good to me. But it would make the union part only meaningful
> > to alloc pasid. if so, maybe make the union part as a data field and
> > only alloc pasid will have it. For free pasid, it is not necessary
> > to read it from userspace. does it look good?
> 
> maybe keeping the union is also OK, just with {min, max} for alloc.
> who knows whether more pasid ops will be added in the future
> which may require its specific union structure. ?? putting pasid
> as a common field is reasonable because the whole cmd is for
> pasid.

got it.

Thanks,
Yi Liu

--- Begin Message ---

Subject: Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
From: Jacob Pan <jacob.jun.pan@xxxxxxxxxxxxxxx>
Date: Fri, 27 Mar 2020 00:02:52 +0000
Cc: "jacob.jun.pan@xxxxxxxxxxxxxxx" <jacob.jun.pan@xxxxxxxxxxxxxxx>
In-reply-to: <1584880325-10561-2-git-send-email-yi.l.liu@intel.com>
References: <1584880325-10561-1-git-send-email-yi.l.liu@intel.com>	<1584880325-10561-2-git-send-email-yi.l.liu@intel.com>

off the list

On Sun, 22 Mar 2020 05:31:58 -0700
"Liu, Yi L" <yi.l.liu@xxxxxxxxx> wrote:

> +     mutex_lock(&vmm->pasid_lock);
> +     pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
> +     if (IS_ERR(pdata)) {
> +             ret = PTR_ERR(pdata);
> +             goto out_unlock;
> +     }
> +     ioasid_free(pasid);
> +
[Jacob Pan] this might be better to put under ioasid code such that it
is under the ioasid lock. no one else can free the ioasid between find()
and free().
e.g. ioasid_free(sid, pasid)
if sid == INVALID_IOASID_SET, then no set ownership checking.

thoughts?

--- End Message ---