Re: [PATCH] vfio pci: kernel support of error recovery only for non fatal error

Cao jin <caoj.fnst@xxxxxxxxxxxxxx> · Tue, 21 Mar 2017 16:05:28 +0800

On 03/20/2017 10:30 PM, Alex Williamson wrote:
> On Mon, 20 Mar 2017 20:50:39 +0800
> Cao jin <caoj.fnst@xxxxxxxxxxxxxx> wrote:
> 
>> Sorry for late.
>>
>> On 03/14/2017 06:06 AM, Alex Williamson wrote:
>>> On Mon, 27 Feb 2017 15:28:43 +0800
>>> Cao jin <caoj.fnst@xxxxxxxxxxxxxx> wrote:
>>>   
>>>> 0. What happens now (PCIE AER only)
>>>>    Fatal errors cause a link reset.
>>>>    Non fatal errors don't.
>>>>    All errors stop the VM eventually, but not immediately
>>>>    because it's detected and reported asynchronously.
>>>>    Interrupts are forwarded as usual.
>>>>    Correctable errors are not reported to guest at all.
>>>>    Note: PPC EEH is different. This focuses on AER.  
>>>
>>> Perhaps you're only focusing on AER, but don't the error handlers we're
>>> using support both AER and EEH generically?  I don't think we can
>>> completely disregard how this affects EEH behavior, if at all.
>>>   
>>
>> After taking a rough look at the EEH,  find that EEH always feed
>> error_detected with pci_channel_io_frozen, from perspective of
>> error_detected, EEH is not affected.  
>>
>> I am not sure about a question: when assign devices in spapr host,
>> should all functions/devices in a PE be bound to vfio? I am kind of
>> confused about the relationship between a PE & a tce iommu group
> 
> AIUI, yes all devices within the PE are part of the same IOMMU group
> and therefore all endpoints must be bound to vfio or pci-stub.
> 

Thanks. Then I think this approach won't affect EEH. I was considering
the same issue you mentioned for slot_reset may affect EEH, but if they
all must be bound to vfio, seems the issue won't happen to EEH.

-- 
Sincerely,
Cao jin