Re: [PATCH v6] vfio error recovery: kernel support

Alex Williamson <alex.williamson@xxxxxxxxxx> · Wed, 5 Apr 2017 16:56:15 -0600

On Thu, 6 Apr 2017 01:36:31 +0300
"Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote:

> On Wed, Apr 05, 2017 at 04:19:10PM -0600, Alex Williamson wrote:
> > On Thu, 6 Apr 2017 00:50:22 +0300
> > "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote:
> >   
> > > On Wed, Apr 05, 2017 at 01:38:22PM -0600, Alex Williamson wrote:  
> > > > The previous intention of trying to handle all sorts of AER faults
> > > > clearly had more value, though even there the implementation and
> > > > configuration requirements restricted the practicality.  For instance
> > > > is AER support actually useful to a customer if it requires all ports
> > > > of a multifunction device assigned to the VM?  This seems more like a
> > > > feature targeting whole system partitioning rather than general VM
> > > > device assignment use cases.  Maybe that's ok, but it should be a clear
> > > > design decision.    
> > > 
> > > Alex, what kind of testing do you expect to be necessary?
> > > Would you say testing on real hardware and making it trigger
> > > AER errors is a requirement?  
> > 
> > Testing various fatal, non-fatal, and corrected errors with aer-inject,
> > especially in multfunction configurations (where more than one port
> > is actually usable) would certainly be required.  If we have cases where
> > the driver for a companion function can escalate a non-fatal error to a
> > bus reset, that should be tested, even if it requires temporary hacks to
> > the host driver for the companion function to trigger that case.  AER
> > handling is not something that the typical user is going to experience,
> > so it should to be thoroughly tested to make sure it works when needed
> > or there's little point to doing it at all.  Thanks,
> > 
> > Alex  
> 
> Some things can be tested within a VM. What would you
> say would be sufficient on a VM and what has to be
> tested on bare metal?

Testing on a VM could be interesting for development, but I'd expect
bare metal for validation, no offense.  Bus reset timing can be
different, error propagation can be different, etc.  Thanks,

Alex