Re: [Qemu-devel] [PATCH 0/3] vfio-ccw: support hsch/csch (kernel part)

Cornelia Huck <cohuck@xxxxxxxxxx> · Fri, 21 Dec 2018 12:23:32 +0100

On Wed, 19 Dec 2018 15:17:19 +0100
Halil Pasic <pasic@xxxxxxxxxxxxx> wrote:

> On Wed, 19 Dec 2018 12:54:42 +0100
> Cornelia Huck <cohuck@xxxxxxxxxx> wrote:
> 
> > On Fri, 7 Dec 2018 17:54:23 +0100
> > Halil Pasic <pasic@xxxxxxxxxxxxx> wrote:
> >   
> > > On Fri, 7 Dec 2018 11:05:29 +0100
> > > Cornelia Huck <cohuck@xxxxxxxxxx> wrote:
> > >   
> > > > > > I think most of the sorting-out-the-operations stuff should be done by
> > > > > > the hardware itself, and we should not really try to enforce anything
> > > > > > special in our vfio code.
> > > > > >         
> > > > > 
> > > > > Sounds very reasonable to me. Does this mean you are against Pierre's
> > > > > '[PATCH v3 6/6] vfio: ccw: serialize the write system calls' as it does
> > > > > not let HW sort out stuff, but enforces sequencing?      
> > > > 
> > > > I have not yet had time to look at that, sorry.
> > > >     
> > > > > 
> > > > >       
> > > > > > For your example, it might be best if a hsch is always accepted and
> > > > > > send on towards the hardware.        
> > > > > 
> > > > > Nod.
> > > > >       
> > > > > > Probably best to reflect back -EAGAIN if
> > > > > > we're currently processing another instruction from another vcpu, so
> > > > > > that the user space caller can retry.        
> > > > > 
> > > > > Hm, not sure how this works together with your previous sentence.      
> > > > 
> > > > The software layering. We have the kernel layer
> > > > (drivers/s390/cio/vfio_ccw_*) that interacts with the hardware more or
> > > > less directly, and the QEMU layer, which does some writes on regions.
> > > > In the end, the goal is to act on behalf of the guest issuing a
> > > > ssch/hsch/csch, which is from the guest's view a single instruction. We
> > > > should not have the individual "instructions" compete with each other
> > > > so that they run essentially in parallel (kernel layer), but we should
> > > > also not try to impose an artificial ordering as to when instructions
> > > > executed by different vcpus are executed (QEMU layer). Therefore, don't
> > > > try to run an instruction in the kernel when another one is in progress
> > > > for the same subchannel (exclusivity in the kernel), but retry in QEMU
> > > > if needed (no ordering between vcpus imposed).
> > > > 
> > > > In short, don't create strange concurrency issues in the "instruction"
> > > > handling, but make it possible to execute instructions in a
> > > > non-predictable order if the guest does not care about enforcing
> > > > ordering on its side.
> > > >     
> > > 
> > > I'm neither sold on this, nor am I violently opposing it. Will try to
> > > meditate on it some more if any spare cycles arise. Currently I don't
> > > see the benefit of the non-predictable order over plain FCFS. For
> > > example, let's assume we have a ssch "instruction" that 1 second to
> > > complete. Since normally ssch instruction  does not have to process the
> > > channel program, and is thus kind of a constant time operation (now we
> > > do the translation and the pinning as a part of the "instruction), our
> > > strange guest gets jumpy and does a csch after T+0.2s. And on T+0.8 in
> > > desperation follows the whole up with a hsch. If I understand your
> > > proposal correctly, both userspace handlers would spin on -EAGAIN until
> > > T+1. When ssch is done the csch and the hsch would race for who can
> > > be the next. I don't quite get the value of that.  
> > 
> > What would happen on real hardware for such a guest? I would expect
> > that the csch and the hsch would be executed in a random order as well.
> >   
> 
> Yes, they would be executed in random order, but would not wait until the
> ssch is done (and especially not wait until the channel program gets
> translated). AFAIR bot cancel the start function immediately -- if any
> pending.
> 
> Furthermore the point where the race is decided is changing the function
> control bits -- the update needs to be an interlocked one obviously.
> 
> What I want to say, there is no merit in waiting -- one second in the
> example. At some point it needs to be decided who is considered first,
> and artificially procrastinating this decision does not do us any good,
> because we may end up with otherwise unlikely behavior.

You've really lost me here :( I fear you're criticizing something I
don't want to implement; I'll write some code, that should make things
much easier to discuss.

> 
> > My point is that it is up to the guest to impose an order on the
> > execution of instructions, if wanted. We should not try to guess
> > anything; I think that would make the implementation needlessly complex.
> >   
> 
> I'm not for guessing stuff, but rather for sticking to the architecture.
> 
> > >   
> > > > >       
> > > > > > Same for ssch, if another ssch is
> > > > > > already being processed. We *could* reflect cc 2 if the fctl
> > > > > > bit is already set, but that won't do for csch, so it is
> > > > > > probably best to have the hardware figure that out in any case.
> > > > > >         
> > > > > 
> > > > > We just need to be careful about avoiding races if we let hw sort
> > > > > out things. If an ssch is issued with the start function pending
> > > > > the correct response is cc 2.       
> > > > 
> > > > But sending it on to the hardware will give us that cc 2, no?
> > > >     
> > > > >       
> > > > > > If I read the code correctly, we currently reflect -EBUSY and
> > > > > > not -EAGAIN if we get a ssch request while already processing
> > > > > > another one. QEMU hands that back to the guest as a cc 2, which
> > > > > > is not 100% correct. In practice, we don't see this with Linux
> > > > > > guests due to locking.   
> > > > > 
> > > > > Nod, does not happen because of BQL. We currently do the
> > > > > user-space counterpart of vfio_ccw_mdev_write() in BQL context or
> > > > > (i.e. we hold BQL until translation is done and our host ssch()
> > > > > comes back)?      
> > > > 
> > > > The Linux kernel uses the subchannel lock to enforce exclusivity for
> > > > subchannel instructions, so we won't see Linux guests issue
> > > > instructions on different vcpus in parallel, that's what I meant.
> > > >    
> > > 
> > > That is cool. Yet I think the situation with the BQL is relevant.
> > > Because while BQL is held, not only IO instructions on a single
> > > vfio-ccw device are mutually exclusive. AFAIU no other instruction
> > > QEMU instruction handler can engage. And a store subchannel for
> > > device A having to wait until the translation for the start
> > > subchannel on device B is done is not the most scary thing I can
> > > imagine.   
> > 
> > Yes. But we still need to be able to cope with a userspace that does
> > not give us those guarantees.
> >   
> 
> I agree. The point I was trying to make is not that 'We are good, because
> qemu takes care of it!' on the contrary, I wanted to give voice to my
> concern that a guest that has a couple of vfio-ccw devices in use could
> experience performance problems because vfio-ccw holds BQL for long.

TBH, I have no idea how this will scale to many vfio-ccw devices.

> 
> > > > > 
> > > > > I think -EBUSY is the correct response for ssch while start
> > > > > pending set. I think we set start pending in QEMU before we issue
> > > > > 'start command/io request' to the kernel. I don't think -EAGAIN
> > > > > is a good idea. AFAIU we would expect user-space to loop on
> > > > > -EAGAIN e.g. at least until the processing of a 'start command'
> > > > > is done and the (fist) ssch by the host is issued. And then
> > > > > what?  Translate the second channel program issue the second ssch
> > > > > in the host and probably get a non-zero cc? Or return -EBUSY? Or
> > > > > keep returning -EAGAIN?      
> > > > 
> > > > My idea was:
> > > > - return -EAGAIN if we're already processing a channel instruction
> > > > - continue returning -EBUSY etc. if the instruction gets the
> > > > respective return code from the hardware
> > > > 
> > > > So, the second ssch would first get a -EAGAIN and then a -EBUSY if
> > > > the first ssch is done, but the subchannel is still doing the start
> > > > function. Just as you would expect when you do a ssch while your
> > > > last request has not finished yet.
> > > >     
> > > 
> > > But before you can issue the second ssch you have to do the
> > > translation for it. And we must assume the IO corresponding to the
> > > first ssch is not done yet -- so we still need the translated channel
> > > program of the first ssch.   
> > 
> > Yes, we need to be able to juggle different translated channel programs
> > if we don't consider this part of the "instruction execution". But if
> > we return -EAGAIN if the code is currently doing that translation, we
> > should be fine, no?
> >   
> 
> As long as you return -EAGAIN we are fine. But AFAIU you proposed to
> do that until the I/O is submitted to the HW subchannel via ssch(). But
> that is not the case I'm talking about here. We have already translated
> the channel program for the first request, submitted it via ssch() and
> are awaiting an interrupt that tells us the I/O is done. While waiting
> for this interrupt we get a new ssch request. I understood, you don't
> want to give -EAGAIN for this one, but make the ssch decide. The problem
> is you still need the old translated channel program for the interrupt
> handling, and at the same time you need the new channel program
> translated as well, before doing the ssch for it in the host.

Why? You're not doing anything with that second ssch at all, it returns
before translation is started.

> 
> > > That is if we insist on doing the -EBUSY
> > > based on a return code from the hardware. I'm not sure we end up with
> > > a big simplification from making the "instructions" mutex on vfio-ccw
> > > device level in kernel as proposed above.   
> > 
> > I'm not sure we're not talking past each other here...   
> 
> I'm afraid we do.
> 
> > the "translate
> > and issue instruction" part should be mutually exclusive; I just don't
> > want to return -EBUSY, but -EAGAIN, so that userspace knows it should
> > try again.
> >   
> 
> I got it. But I wanted to point out, that we need the old channel program
> *beyond* the "translate and issue instruction".

Of course. But I don't want to start a new channel program.

> 
> > > But I'm not against it. If
> > > you have the time to write the patches I will find time to review
> > > them.  
> > 
> > Probably only on the new year...  
> 
> I think the stuff is better discussed with code at hand. I'm happy to
> continue this discussion if you think it is useful to you. Otherwise I
> suggest do it the way you think is the best, and I will try  to find and
> to point out the problems, if any.

I'll try to post something in January. Have a nice holiday break :)