Ping? On Wed, Oct 30, 2019 at 2:41 PM Alex Deucher <alexdeucher@xxxxxxxxx> wrote: > > Clarify some areas, clean up formatting, add section for > unrecoverable error handling. > > Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx> > --- > Documentation/gpu/amdgpu.rst | 35 ++++++++++++++++++++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 40 ++++++++++++++++++++----- > 2 files changed, 68 insertions(+), 7 deletions(-) > > diff --git a/Documentation/gpu/amdgpu.rst b/Documentation/gpu/amdgpu.rst > index 5b9eaf23558e..1c08d64970ee 100644 > --- a/Documentation/gpu/amdgpu.rst > +++ b/Documentation/gpu/amdgpu.rst > @@ -82,12 +82,21 @@ AMDGPU XGMI Support > AMDGPU RAS Support > ================== > > +The AMDGPU RAS interfaces are exposed via sysfs (for informational queries) and > +debugfs (for error injection). > + > RAS debugfs/sysfs Control and Error Injection Interfaces > -------------------------------------------------------- > > .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > :doc: AMDGPU RAS debugfs control interface > > +RAS Reboot Behavior for Unrecoverable Errors > +-------------------------------------------------------- > + > +.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > + :doc: AMDGPU RAS Reboot Behavior for Unrecoverable Errors > + > RAS Error Count sysfs Interface > ------------------------------- > > @@ -109,6 +118,32 @@ RAS VRAM Bad Pages sysfs Interface > .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > :internal: > > +Sample Code > +----------- > +Sample code for testing error injection can be found here: > +https://cgit.freedesktop.org/mesa/drm/tree/tests/amdgpu/ras_tests.c > + > +This is part of the libdrm amdgpu unit tests which cover several areas of the GPU. > +There are four sets of tests: > + > +RAS Basic Test > + > +The test verifies the RAS feature enabled status and makes sure the necessary sysfs and debugfs files > +are present. > + > +RAS Query Test > + > +This test will check the RAS availability and enablement status for each supported IP block as well as > +the error counts. > + > +RAS Inject Test > + > +This test injects errors for each IP. > + > +RAS Disable Test > + > +This tests disabling of RAS features for each IP block. > + > > GPU Power/Thermal Controls and Monitoring > ========================================= > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > index dab90c280476..404483437bd3 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > @@ -220,7 +220,7 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, > * As their names indicate, inject operation will write the > * value to the address. > * > - * Second member: struct ras_debug_if::op. > + * The second member: struct ras_debug_if::op. > * It has three kinds of operations. > * > * - 0: disable RAS on the block. Take ::head as its data. > @@ -228,14 +228,20 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, > * - 2: inject errors on the block. Take ::inject as its data. > * > * How to use the interface? > - * programs: > - * copy the struct ras_debug_if in your codes and initialize it. > - * write the struct to the control node. > + * > + * Programs > + * > + * Copy the struct ras_debug_if in your codes and initialize it. > + * Write the struct to the control node. > + * > + * Shells > * > * .. code-block:: bash > * > * echo op block [error [sub_block address value]] > .../ras/ras_ctrl > * > + * Parameters: > + * > * op: disable, enable, inject > * disable: only block is needed > * enable: block and error are needed > @@ -265,8 +271,10 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, > * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count > * > * .. note:: > - * Operation is only allowed on blocks which are supported. > + * Operations are only allowed on blocks which are supported. > * Please check ras mask at /sys/module/amdgpu/parameters/ras_mask > + * to see which blocks support RAS on a particular asic. > + * > */ > static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *buf, > size_t size, loff_t *pos) > @@ -322,7 +330,7 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user * > * DOC: AMDGPU RAS debugfs EEPROM table reset interface > * > * Some boards contain an EEPROM which is used to persistently store a list of > - * bad pages containing ECC errors detected in vram. This interface provides > + * bad pages which experiences ECC errors in vram. This interface provides > * a way to reset the EEPROM, e.g., after testing error injection. > * > * Usage: > @@ -362,7 +370,7 @@ static const struct file_operations amdgpu_ras_debugfs_eeprom_ops = { > /** > * DOC: AMDGPU RAS sysfs Error Count Interface > * > - * It allows user to read the error count for each IP block on the gpu through > + * It allows the user to read the error count for each IP block on the gpu through > * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count > * > * It outputs the multiple lines which report the uncorrected (ue) and corrected > @@ -1027,6 +1035,24 @@ static int amdgpu_ras_sysfs_remove_all(struct amdgpu_device *adev) > } > /* sysfs end */ > > +/** > + * DOC: AMDGPU RAS Reboot Behavior for Unrecoverable Errors > + * > + * Normally when there is an uncorrectable error, the driver will reset > + * the GPU to recover. However, in the event of an unrecoverable error, > + * the driver provides an interface to reboot the system automatically > + * in that event. > + * > + * The following file in debugfs provides that interface: > + * /sys/kernel/debug/dri/[0/1/2...]/ras/auto_reboot > + * > + * Usage: > + * > + * .. code-block:: bash > + * > + * echo true > .../ras/auto_reboot > + * > + */ > /* debugfs begin */ > static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev) > { > -- > 2.23.0 > _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx