Re: [PATCH v3 3/4] vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT

Alex Williamson <alex.williamson@xxxxxxxxxx> · Wed, 16 Oct 2024 10:11:34 -0600

On Wed, 16 Oct 2024 11:35:05 +0800
Yi Liu <yi.l.liu@xxxxxxxxx> wrote:

> On 2024/10/16 00:22, Alex Williamson wrote:
> > On Tue, 15 Oct 2024 19:07:52 +0800
> > Yi Liu <yi.l.liu@xxxxxxxxx> wrote:
> >   
> >> On 2024/10/14 23:49, Alex Williamson wrote:  
> >>> On Sat, 12 Oct 2024 21:49:05 +0800
> >>> Yi Liu <yi.l.liu@xxxxxxxxx> wrote:
> >>>      
> >>>> On 2024/10/1 20:11, Jason Gunthorpe wrote:  
> >>>>> On Mon, Sep 30, 2024 at 07:55:08AM +0000, Tian, Kevin wrote:
> >>>>>         
> >>>>>>> +struct vfio_device_pasid_attach_iommufd_pt {
> >>>>>>> +	__u32	argsz;
> >>>>>>> +	__u32	flags;
> >>>>>>> +	__u32	pasid;
> >>>>>>> +	__u32	pt_id;
> >>>>>>> +};
> >>>>>>> +
> >>>>>>> +#define VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT	_IO(VFIO_TYPE,
> >>>>>>> VFIO_BASE + 21)  
> >>>>>>
> >>>>>> Not sure whether this was discussed before. Does it make sense
> >>>>>> to reuse the existing VFIO_DEVICE_ATTACH_IOMMUFD_PT
> >>>>>> by introducing a new pasid field and a new flag bit?  
> >>>>>
> >>>>> Maybe? I don't have a strong feeling either way.
> >>>>>
> >>>>> There is somewhat less code if you reuse the ioctl at least  
> >>>>
> >>>> I had a rough memory that I was suggested to add a separate ioctl for
> >>>> PASID. Let's see Alex's opinion.  
> >>>
> >>> I don't recall any previous arguments for separate ioctls, but it seems
> >>> to make a lot of sense to me to extend the existing ioctls with a flag
> >>> to indicate pasid cscope and id.  Thanks,  
> >>
> >> thanks for the confirmation. How about the below?
> >>
> >> diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c
> >> index bb1817bd4ff3..c78533bce3c6 100644
> >> --- a/drivers/vfio/device_cdev.c
> >> +++ b/drivers/vfio/device_cdev.c
> >> @@ -162,21 +162,34 @@ void vfio_df_unbind_iommufd(struct vfio_device_file *df)
> >>    int vfio_df_ioctl_attach_pt(struct vfio_device_file *df,
> >>    			    struct vfio_device_attach_iommufd_pt __user *arg)
> >>    {
> >> -	struct vfio_device *device = df->device;
> >> +	unsigned long minsz = offsetofend(
> >> +			struct vfio_device_attach_iommufd_pt, pt_id);
> >>    	struct vfio_device_attach_iommufd_pt attach;
> >> -	unsigned long minsz;
> >> +	struct vfio_device *device = df->device;
> >> +	u32 user_size;
> >>    	int ret;
> >>
> >> -	minsz = offsetofend(struct vfio_device_attach_iommufd_pt, pt_id);
> >> +	ret = get_user(user_size, (u32 __user *)arg);
> >> +	if (ret)
> >> +		return ret;
> >>
> >> -	if (copy_from_user(&attach, arg, minsz))
> >> -		return -EFAULT;
> >> +	ret = copy_struct_from_user(&attach, sizeof(attach), arg, user_size);
> >> +	if (ret)
> >> +		return ret;  
> > 
> > I think this could break current users.   
> 
> not quite get here. My understanding is as the below:
> 
> If the current user (compiled with the existing uapi struct), it will
> provide less data that the new kernel knows. The copy_struct_from_user()
> would zero the trailing bytes. And such user won't set the new flag, so
> it should be fine.

You're making an assumption that the user is passing exactly the
existing struct size as argsz, which is not a requirement.  The user
could allocate a buffer page, argsz might be 4K.

> So to me, the trailing bytes concern comes when user is compiled with the
> new uapi struct but running on an eld kernel (say <= 6.12).But the eld
> kernel uses copy_from_user(), so even there is non-zero trailing bytes in
> the user buffer, the eld kernel just ignores them. So this seems not an
> issue to me so far.

That's new userspace, old kernel.  I'm referencing an issue with old
userspace, new kernel, where old userspace has no requirement that
argsz is exactly the structure size, nor that the trailing bytes from
the structure size to argsz are zero'd.

> > For better or worse, we don't
> > currently have any requirements for the remainder of the user buffer,
> > whereas copy_struct_from_user() returns an error for non-zero trailing
> > bytes.  
> 
> This seems to be a general requirement when using copy_struct_from_user().
> User needs to zero the fields that it does not intend to use. If there is
> no such requirement for user, then the trailing bytes can be a concern in
> the future but not this time as the existing kernel uses copy_from_user().

Exactly, the current implementation makes no requirement on trailing
non-zero bytes, copy_struct_from_user() does.

> > I think we need to monotonically increase the structure size,
> > but maybe something more like below, using flags.  The expectation
> > would be that if we add another flag that extends the structure, we'd
> > test that flag after PASID and clobber xend to a new value further into
> > the new structure.  We'd also add that flag to the flags mask, but we'd
> > share the copy code.  
> 
> agree, this share code might be needed for other path as well. Some macros
> I guess.
> 
> > 
> > 	if (attach.argsz < minsz)
> > 		return -EINVAL;
> > 
> > 	if (attach.flags & (~VFIO_DEVICE_ATTACH_PASID))
> > 		return -EINVAL;
> > 
> > 	if (attach.flags & VFIO_DEVICE_ATTACH_PASID)
> > 		xend = offsetofend(struct vfio_device_attach_iommufd_pt, pasid);
> > 
> > 	if (xend) {
> > 		if (attach.argsz < xend)
> > 			return -EINVAL;
> > 	
> > 		if (copy_from_user((void *)&attach + minsz,
> > 				    (void __user *)arg + minsz, xend - minsz))
> > 			return -EFAULT;
> > 	}  
> 
> I think it might need to zero the trailing bytes if the user does not set
> the extended flag. is it?

We could zero initialize the attach structure for safety, but the kernel
side code should also be driven by flags.  We should never look at a
field beyond the base structure that isn't indicated by flags and copied
from the user.

> > Maybe there are still more elegant options available.
> > 
> > We also generally try to label flags with FLAGS in the name, but it
> > does get rather unwieldy, so I'm open to suggestions.  Thanks,  
> 
> There is already examples that new field added to a kernel-to-user uapi
> struct like the vfio_region_info::cap_offset and
> vfio_device_info::cap_offset. But it might be a little bit different
> with the case we face here as it's user-to-kernel struct. It's a time for
> you to set a rule for such extensions. :)

That's a returned structure, so it's a bit different, but the same
philosophy, we don't break userspace.  In that case we used argsz as an
output field and flags to indicate there are capabilities.  Old
userspace ignores the flag and argsz semantics and continues to work.
New userspace checks the flag, reallocs the buffer with argsz and
retries.  This is however an example of userspace providing an argsz
value that exceeds the ioctl structure, with no requirement to zero the
buffer, though it is an output structure rather than the input
structure here.

I think the same holds here, our policy is to not break userspace.  We
potentially do break userspace if we impose a requirement that the
trailing buffer size must now be zero.  We have the flags field so that
we don't need to blindly base the structure version on the size of the
user buffer.  We should use the flags field to determine relevant and
valid fields beyond the base structure without imposing new
requirements to userspace.  Thanks,

Alex