Re: [RFC v1 2/2] vfio-ccw: Don't exit early if state of the vfio-ccw subchannel is not idle

Cornelia Huck <cohuck@xxxxxxxxxx> · Tue, 22 Jan 2019 11:20:41 +0100

On Mon, 21 Jan 2019 21:10:32 +0100
Halil Pasic <pasic@xxxxxxxxxxxxx> wrote:

> On Mon, 21 Jan 2019 17:55:39 +0100
> Cornelia Huck <cohuck@xxxxxxxxxx> wrote:
> 
> > On Mon, 21 Jan 2019 11:14:16 -0500
> > Farhan Ali <alifm@xxxxxxxxxxxxx> wrote:
> >   
> > > On 01/21/2019 10:52 AM, Cornelia Huck wrote:  
> > > > On Mon, 21 Jan 2019 10:48:32 -0500
> > > > Farhan Ali <alifm@xxxxxxxxxxxxx> wrote:
> > > >     
> > > >> On 01/21/2019 10:18 AM, Cornelia Huck wrote:    
> > > >>> On Mon, 21 Jan 2019 09:54:09 -0500
> > > >>> Farhan Ali <alifm@xxxxxxxxxxxxx> wrote:
> > > >>>        
> > > >>>> This check is unecessary as we already have the vfio state machine
> > > >>>> to handle I/O requests.
> > > >>>>
> > > >>>> On the other hand, this check returns incorrect information to
> > > >>>> userspace if the state of the subchannel is not idle. For example
> > > >>>> if the state is busy and new I/O request comes in, this will return
> > > >>>> an EACCES, whereas we should return EBUSY.
> > > >>>>
> > > >>>> Signed-off-by: Farhan Ali <alifm@xxxxxxxxxxxxx>
> > > >>>> ---
> > > >>>>    drivers/s390/cio/vfio_ccw_ops.c | 2 --
> > > >>>>    1 file changed, 2 deletions(-)
> > > >>>>
> > > >>>> diff --git a/drivers/s390/cio/vfio_ccw_ops.c b/drivers/s390/cio/vfio_ccw_ops.c
> > > >>>> index f673e10..3fdcc6d 100644
> > > >>>> --- a/drivers/s390/cio/vfio_ccw_ops.c
> > > >>>> +++ b/drivers/s390/cio/vfio_ccw_ops.c
> > > >>>> @@ -193,8 +193,6 @@ static ssize_t vfio_ccw_mdev_write(struct mdev_device *mdev,
> > > >>>>    		return -EINVAL;
> > > >>>>    
> > > >>>>    	private = dev_get_drvdata(mdev_parent_dev(mdev));
> > > >>>> -	if (private->state != VFIO_CCW_STATE_IDLE)
> > > >>>> -		return -EACCES;
> > > >>>>    
> > > >>>>    	region = private->io_region;
> > > >>>>    	if (copy_from_user((void *)region + *ppos, buf,
> > > >>>> count))    
> > > >>>
> > > >>> Hm, the patchset for halt/clear handling I recently posted
> > > >>> changes this to a check for NOT_OPER || STANDBY. What do you
> > > >>> think of that option?
> > > >>>
> > > >>>        
> > > >> I am concerned with the return code that we send userspace. With
> > > >> the state machines we return an EIO for NOT_OPER or STANDBY, but
> > > >> we return EACCES in the early check. QEMU on an EACCES returns a
> > > >> 'not_oper' to the guest and for EIO will inject an interrupt.    
> > > > 
> > > > I actually think that not_oper is the saner option here: the
> > > > device is in a state on the vfio side where it is not usable...    
> > > 
> > > Agreed. Then we should the change the return codes with the state 
> > > machine to EACCES as well, that could be a follow up patch to your
> > > series.  
> > 
> > Yes, let's do that.
> > 
> > As you seem to be able to hit this problem with your workload, maybe
> > post it as a standalone fix? I can easily rebase my series on top of
> > that (and I assume that it will need more review than a simple fix.)
> >   
> 
> I had an issue following the discussion so I chatted up Farhan. Turns
> out the bad private->state he sees is VFIO_CCW_STATE_BUSY. That is, I
> guess, a should normally not happen. I mean QEMU should not accept a new
> SSCH before the START_PENDING for the old one is cleared. And that would
> normally happen after the IO corresponding to the old SSCH is done. Well
> we do have our wonderful emulated CSCH/SSCH at the moment (which do clear
> the SCSW in QEMU but don't bother to terminate IO at the device). And we
> may or may not have bugs in all the SCSW handling/updating code as well.

Well, that hopefully becomes better once we have proper halt/clear
passthrough. Any kind of emulation will continue to be problematic in
one way or the other.

> The problem with this patch is that we poke private->io_region before
> calling the state machine stuff. That could race with the interrupt
> handler touching the same region. That could have been OK, if IRB in the
> region was read-only. But it is not, and QEMU zeroes it out on each
> write AFAICT. Changing the condition to NOT_OPER || STANDBY or
> omitting the check (at the moment) does not seem to be what I would call
> a fix.

So, should this wait until after we rewrite the state machine and I/O
region concurrency?

I'm not sure what the test setup exercises, but making the return code
(-EACCES vs. -EIO) consistent looks like a win to me.