Re: [PATCH v9 2/3] Documentation: add a isolation strategy sysfs node for uacce

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Oct 25, 2022 at 12:39:30PM +0000, Kai Ye wrote:
> Update documentation describing sysfs node that could help to
> configure isolation strategy for users in the user space. And
> describing sysfs node that could read the device isolated state.
> 
> Signed-off-by: Kai Ye <yekai13@xxxxxxxxxx>
> ---
>  Documentation/ABI/testing/sysfs-driver-uacce | 27 ++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-driver-uacce b/Documentation/ABI/testing/sysfs-driver-uacce
> index 08f2591138af..50737c897ba3 100644
> --- a/Documentation/ABI/testing/sysfs-driver-uacce
> +++ b/Documentation/ABI/testing/sysfs-driver-uacce
> @@ -19,6 +19,33 @@ Contact:        linux-accelerators@xxxxxxxxxxxxxxxx
>  Description:    Available instances left of the device
>                  Return -ENODEV if uacce_ops get_available_instances is not provided
>  
> +What:           /sys/class/uacce/<dev_name>/isolate_strategy
> +Date:           Oct 2022
> +KernelVersion:  6.1
> +Contact:        linux-accelerators@xxxxxxxxxxxxxxxx
> +Description:    (RW) Configure the frequency size for the hardware error
> +                isolation strategy. This unit is the number of times. Number

Number of times what?

> +                of occurrences in a period, also means threshold. If the number
> +                of device pci AER error exceeds the threshold in a time window,

What is the time window?

> +                the device is isolated. This size is a configured integer value.
> +                The default is 0. The maximum value is 65535.
> +
> +                In the hisilicon accelerator engine, first we will
> +                time-stamp every slot AER error. Then check the AER error log
> +                when the device AER error occurred. if the device slot AER error
> +                count exceeds the preset the number of times in one hour, the
> +                isolated state will be set to true. So the device will be
> +                isolated. And the AER error log that exceed one hour will be
> +                cleared.

This seems like a very hardware-specific implementation here.  And this
is supposed to be a generic class?

I feel this is getting really messy :(

thanks,

greg k-h



[Index of Archives]     [LM Sensors]     [Linux Sound]     [ALSA Users]     [ALSA Devel]     [Linux Audio Users]     [Linux Media]     [Kernel]     [Gimp]     [Yosemite News]     [Linux Media]

  Powered by Linux