On 2022/9/19 17:34, Greg KH wrote: > On Mon, Sep 19, 2022 at 11:21:30AM +0800, yekai (A) wrote: >> >> On 2022/9/9 16:27, Greg KH wrote: >>> On Fri, Sep 02, 2022 at 03:13:03AM +0000, Kai Ye wrote: >>>> Update documentation describing sysfs node that could help to >>>> configure isolation strategy for users in the user space. And >>>> describing sysfs node that could read the device isolated state. >>>> >>>> Signed-off-by: Kai Ye <yekai13@xxxxxxxxxx> >>>> --- >>>> Documentation/ABI/testing/sysfs-driver-uacce | 26 ++++++++++++++++++++ >>>> 1 file changed, 26 insertions(+) >>>> >>>> diff --git a/Documentation/ABI/testing/sysfs-driver-uacce b/Documentation/ABI/testing/sysfs-driver-uacce >>>> index 08f2591138af..af5bc2f326d2 100644 >>>> --- a/Documentation/ABI/testing/sysfs-driver-uacce >>>> +++ b/Documentation/ABI/testing/sysfs-driver-uacce >>>> @@ -19,6 +19,32 @@ Contact: linux-accelerators@xxxxxxxxxxxxxxxx >>>> Description: Available instances left of the device >>>> Return -ENODEV if uacce_ops get_available_instances is not provided >>>> >>>> +What: /sys/class/uacce/<dev_name>/isolate_strategy >>>> +Date: Sep 2022 >>>> +KernelVersion: 6.0 >>>> +Contact: linux-accelerators@xxxxxxxxxxxxxxxx >>>> +Description: (RW) Configure the frequency size for the hardware error >>>> + isolation strategy. This size is a configured integer value. >>>> + The default is 0. The maximum value is 65535. This value is a >>>> + threshold based on your driver strategies. >>> I do not understand what the units are here. >>> >>> How is anyone supposed to know what they are? >> This unit is the number of times. Number of occurrences in a period, also means threshold. >> If the number of device pci AER error exceeds the threshold in a time window, the device is >> isolated. > Please document this very very well. > >>>> + For example, in the hisilicon accelerator engine, first we will >>>> + time-stamp every slot AER error. Then check the AER error log >>>> + when the device AER error occurred. if the device slot AER error >>>> + count exceeds the preset the number of times in one hour, the >>>> + isolated state will be set to true. So the device will be >>>> + isolated. And the AER error log that exceed one hour will be >>>> + cleared. Of course, different strategies can be defined in >>>> + different drivers. >>> So this file can contain values of different units depending on the >>> different driver that creates it? How is anyone supposed to know what >>> it is and what it should be? >>> >>> This feels very loose, please define this much better so that it can be >>> understood and maintained properly. >>> >>> thanks, >>> >>> greg k-h >>> . >>> >> Yes, We started out with the idea of not restricting the different drive, only specifying the input and output. >> Because we think different drivers require different processing strategy. > What different drivers? You only have one! And why do you need a > framework for only one driver? You should only add that when you have > multiple users to ensure you got the framework correct otherwise you do > not know how it will be used. > > thanks, > > greg k-h > . > ok . I will move isolation strategy from qm to uacce in the next version. it can be understood and maintained properly. thanks Kai