Re: [RFC v1 1/1] vfio-ccw: Don't call cp_free if we are processing a channel program

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 24 Jun 2019 11:42:31 +0200
Cornelia Huck <cohuck@xxxxxxxxxx> wrote:

> On Fri, 21 Jun 2019 14:34:10 -0400
> Farhan Ali <alifm@xxxxxxxxxxxxx> wrote:
> 
> > On 06/21/2019 01:40 PM, Eric Farman wrote:
> > > 
> > > 
> > > On 6/21/19 10:17 AM, Farhan Ali wrote:  
> > >>
> > >>
> > >> On 06/20/2019 04:27 PM, Eric Farman wrote:  
> > >>>
> > >>>
> > >>> On 6/20/19 3:40 PM, Farhan Ali wrote:  
> > >>>> There is a small window where it's possible that an interrupt can
> > >>>> arrive and can call cp_free, while we are still processing a channel
> > >>>> program (i.e allocating memory, pinnging pages, translating  
> > >>>
> > >>> s/pinnging/pinning/
> > >>>  
> > >>>> addresses etc). This can lead to allocating and freeing at the same
> > >>>> time and can cause memory corruption.
> > >>>>
> > >>>> Let's not call cp_free if we are currently processing a channel program.  
> > >>>
> > >>> The check around this cp_free() call is for a solicited interrupt, so
> > >>> it's presumably in response to a SSCH we issued.  But if we're still
> > >>> processing a CP, then we hadn't issued the SSCH to the hardware yet.  So
> > >>> what is this interrupt for?  Do the contents of irb.cpa provide any
> > >>> clues, perhaps if it's in the current cp or for someone else?
> > >>>  
> > >>
> > >> I don't think the interrupt is in response to an ssch but rather due to
> > >> an csch/hsch.
> 
> The solicited check only checks if it is solicited. It can be for any
> channel I/O instruction that causes an interrupt... we probably should
> adapt the check.
> 
> > >>  
> > >>>>
> > >>>> Signed-off-by: Farhan Ali <alifm@xxxxxxxxxxxxx>
> > >>>> ---
> > >>>>
> > >>>> I have been running my test overnight with this patch and I haven't
> > >>>> seen the stack traces that I mentioned about earlier. I would like
> > >>>> to get some reviews on this and also if this is the right thing to
> > >>>> do?
> > >>>>
> > >>>> Thanks
> > >>>> Farhan
> > >>>>
> > >>>>    drivers/s390/cio/vfio_ccw_drv.c | 2 +-
> > >>>>    1 file changed, 1 insertion(+), 1 deletion(-)
> > >>>>
> > >>>> diff --git a/drivers/s390/cio/vfio_ccw_drv.c
> > >>>> b/drivers/s390/cio/vfio_ccw_drv.c
> > >>>> index 66a66ac..61ece3f 100644
> > >>>> --- a/drivers/s390/cio/vfio_ccw_drv.c
> > >>>> +++ b/drivers/s390/cio/vfio_ccw_drv.c
> > >>>> @@ -88,7 +88,7 @@ static void vfio_ccw_sch_io_todo(struct work_struct
> > >>>> *work)
> > >>>>                 (SCSW_ACTL_DEVACT | SCSW_ACTL_SCHACT));
> > >>>>        if (scsw_is_solicited(&irb->scsw)) {
> > >>>>            cp_update_scsw(&private->cp, &irb->scsw);  
> > >>>
> > >>> As I alluded earlier, do we know this irb is for this cp?  If no, what
> > >>> does this function end up putting in the scsw?
> 
> Yes, I think this also needs to check whether we have at least a prior
> start function around. (We use the orb provided by the guest; maybe we
> should check if that intparm is set in the irb?)

Hrm; not so easy as we always set the intparm to the address of the
subchannel structure... 

Maybe check if we have have one of the conditions of the large table
16-6 and correlate to the ccw address? Or is it enough to check the
function control? (Don't remember when the hardware resets it.)

> 
> > >>>  
> > >>>> -        if (is_final)
> > >>>> +        if (is_final && private->state != VFIO_CCW_STATE_CP_PROCESSING)  
> > >>>
> > >>> In looking at how we set this state, and how we exit it, I see we do:
> > >>>
> > >>> if SSCH got CC0, CP_PROCESSING -> CP_PENDING
> > >>> if SSCH got !CC0, CP_PROCESSING -> IDLE
> > >>>
> > >>> While the first scenario happens immediately after the SSCH instruction,
> > >>> I guess it could be just tiny enough, like the io_trigger FSM patch I
> > >>> sent a few weeks ago.
> > >>>
> > >>> Meanwhile, the latter happens way after we return from the jump table.
> > >>> So that scenario leaves considerable time for such an interrupt to
> > >>> occur, though I don't understand why it would if we got a CC(1-3) on the
> > >>> SSCH.
> > >>>
> > >>> And anyway, the return from fsm_io_helper() in that case will also call
> > >>> cp_free().  So why does the cp->initialized check provide protection
> > >>> from a double-free in that direction, but not here?  I'm confused.  
> > >>
> > >> I have a theory where I think it's possible to have 2 different threads
> > >> executing cp_free
> > >>
> > >> If we start with private->state == IDLE and the guest issues a
> > >> clear/halt and then an ssch
> > >>
> > >> - clear/halt will be issued to hardware, and if succeeds we will return
> > >> cc=0 to guest
> > >>
> > >> - the guest can then issue ssch  
> > > 
> > > It can issue whatever it wants, but shouldn't the SSCH get a CC2 until
> > > the halt/clear pending state is cleared?  Hrm, fsm_io_request() doesn't
> > > look.  Rather, it (fsm_io_helper()) relies on the CC2 to be signalled
> > > from the SSCH issued to the device.  There's a lot of stuff that happens
> > > before we get to that point.  
> > 
> > Yes, and stuff that happens is memory allocation, pinning and other 
> > things which can take time.
> > 
> > > 
> > > I'm wondering if there's a way we could/should return the SSCH here
> > > before we do any processing.  After all, until the interrupt on the
> > > workqueue is processed, we are busy.
> > >   
> > 
> > you mean return the csch/hsch before processing the ssch? Maybe we could 
> > extend CP_PENDING state, to just PENDING and use that to reject any ssch 
> > if we have a pending csch/hsch?
> 
> I'd probably not rely on the state for this. Maybe we could look at the
> work queue? But it might be that the check for the intparm I suggested
> above is already enough.
> 
> > 
> > >>
> > >> - we get an interrupt for csch/hsch and we queue the interrupt in the
> > >> workqueue
> > >>
> > >> - we start processing the ssch and then at the same time another cpu
> > >> could be working on the  
> > >> interrupt>  
> > >>
> > >> Thread 1                                        Thread 2
> > >> --------                                        --------
> > >>
> > >> fsm_io_request                                  vfio_ccw_sch_io_todo
> > >>      cp_init                                         cp_free
> > >>      cp_prefetch
> > >>      fsm_io_helper
> > >>          cp_free
> > >>
> > >>
> > >>
> > >> The test that I am trying is with a guest running an fio workload, while
> > >> at the same time stressing the error recovery path in the guest. So
> > >> there is a lot of ssch and lot of csch.
> > >>
> > >> Of course I don't think my patch completely solves the problem, I think
> > >> it just makes the window narrower. I just wanted to get a discussion
> > >> started :)
> > >>
> > >>
> > >> Now that I am thinking more about it, I think we might have to protect
> > >> cp with it's own mutex.  
> > > 
> > > That seems like a big hammer, and I wonder if the existing SCHIB/FSM/CP
> > > state data doesn't provide that information to us.  But I gotta wander
> > > around some code before I say.  
> > 
> > Any ideas are welcome :)
> 
> See above :) That certainly looks like a much smaller hammer.
> 
> > 
> > >   
> > >>
> > >> Thanks
> > >> Farhan
> > >>
> > >>  
> > >>>  
> > >>>>                cp_free(&private->cp);
> > >>>>        }
> > >>>>        mutex_lock(&private->io_mutex);
> > >>>>  
> > >>>  
> > >>  
> > >   
> 




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux