Re: Hard and silent lock up since linux 3.14 with PCIe pass through (vfio)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 2014-10-29 at 17:47 +0100, Andreas Hartmann wrote:
> Alex Williamson wrote:
> > On Sat, 2014-10-25 at 08:03 +0200, Andreas Hartmann wrote:
> >>
> >> Out of interest:
> >> Bjorn's patch disables vc save/restore support - and the machine works
> >> fine again. Why is it needed at all if it seems to work perfectly w/o
> >> it? What's the additional benefit? Or in other words: What am I missing
> >> until today :-) ? What would be better? What could I do more?
> > 
> > 
> > You're right, in the configuration you have the endpoint device has a
> > Virtual Channel capability but the upstream root port does not.  The
> > spec is not at all clear about defining the endpoints for enabling
> > Virtual Channel in each type of configuration, but I think that if we
> > have an upstream port that does not support Virtual Channel, we can skip
> > the save/restore.  Please test the patch below.
> > 
> > I'm also still completely confused about whether this is a VC
> > save/restore issue or a bus reset issue.  You originally bisected this
> > back to the VC save/restore patch, but you also found that a manual,
> > setpci-based bus reset triggered a system hang.
> 
> With your additional patch posted here:
> http://article.gmane.org/gmane.linux.kernel.pci/36162

Right, a reset via sysfs also triggered it with that patch, but the
reset via setpci is independent of any VC save/restore and still hung
your box.

> 
> >  I believe that
> > re-ordering the kernel reset mechanisms also triggered this.  Since
> > recent versions of QEMU are going to favor a bus reset over PM reset, I
> > don't have a lot of confidence that we're actually solving the problem
> > for you.  Please make sure to test with a recent QEMU to be sure we'll
> > do a bus reset.
> 
> I'm running qemu 2.1.0 (newest is 2.1.2 - but this shouldn't be a
> problem) and tested w/ linux 3.17.

Yep, just want to make sure it's QEMU new enough to do a bus reset and
kernel with matching support.

> > diff --git a/drivers/pci/vc.c b/drivers/pci/vc.c
> > index 7e1304d..6d13d34 100644
> > --- a/drivers/pci/vc.c
> > +++ b/drivers/pci/vc.c
> > @@ -339,6 +339,25 @@ static int pci_vc_do_save_buffer(struct pci_dev *dev, int pos,
> >  	return buf ? 0 : len;
> >  }
> >  
> > +/**
> > + * pci_vc_needs_save - Determine whether a VC capability needs to be saved
> > + * @dev: device
> > + * @id: VC capability ID (VC/VC9/MFVC)
> > + *
> > + * In configurations where we have a VC or MFVC capability, but the upstream
> > + * device does not, we assume that VC save (and therefore restore) is not
> > + * necessary.  The intention is to only do VC save/restore in configuration
> > + * where it's necessary and hopefully avoid reset issues.
> > + */
> > +static bool pci_vc_needs_save(struct pci_dev *dev, u16 id)
> > +{
> > +	if (id == PCI_EXT_CAP_ID_VC9 || pci_is_root_bus(dev->bus) ||
> > +	    pci_find_ext_capability(dev->bus->self, PCI_EXT_CAP_ID_VC))
> > +		return true;
> > +
> > +	return false;
> > +}
> > +
> >  static struct {
> >  	u16 id;
> >  	const char *name;
> > @@ -362,7 +381,7 @@ int pci_save_vc_state(struct pci_dev *dev)
> >  		struct pci_cap_saved_state *save_state;
> >  
> >  		pos = pci_find_ext_capability(dev, vc_caps[i].id);
> > -		if (!pos)
> > +		if (!posi || !pci_vc_needs_save(dev, vc_caps[i].id))
>                         ^
> This should be most probably !pos (and not !posi - because !posi does
> through a compile error).

Oops, sorry.

> >  			continue;
> >  
> >  		save_state = pci_find_saved_ext_cap(dev, vc_caps[i].id);
> > @@ -422,7 +441,7 @@ void pci_allocate_vc_save_buffers(struct pci_dev *dev)
> >  	for (i = 0; i < ARRAY_SIZE(vc_caps); i++) {
> >  		int len, pos = pci_find_ext_capability(dev, vc_caps[i].id);
> >  
> > -		if (!pos)
> > +		if (!pos || !pci_vc_needs_save(dev, vc_caps[i].id))
> >  			continue;
> >  
> >  		len = pci_vc_do_save_buffer(dev, pos, NULL, false);
> 
> W/ the above patch, the machine hangs again (w/ qemu and setpci), but w/
> Bjorn's patch (and nothing more applied) which disables vc save/restore,
> the machine just works fine ... . I especially retested this case to be
> really sure. I'm so sorry. But that's how it behaves here :-(

Hmm, the intention was that this should effectively do the same thing as
Bjorn's patch.  The Atheros device (03:00.0) reports a VC capability but
the root port above it (00:05.0) does not.  The test in
pci_vc_needs_save() should therefore be false for all tests within the
if() block and the function should return false, causing us to neither
allocate a save buffer or perform a save.  The restore will
automatically be skipped since there's no save buffer.

This is what I was afraid of, on one hand you've bisected and proved via
patching that the problem is exclusively due to VC save/restore, but we
also have testing that indicates that we can't do a bus reset at all on
this port.  So I re-iterate my confusion that we don't seem to have a
good idea which is the problem.  Do you have any other devices that you
can install in the slot for testing?  Maybe if we knew that bus reset is
only a problem with the Atheros card then we could blacklist it.
Thanks,

Alex


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux