TDR and VRAM lost handling in KMD:

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Christian & Nicolai,

We need to achieve some agreements on what should MESA/UMD do and what should KMD do, please give your comments with â??okayâ?? or â??Noâ?? and your idea on below items,


l  When a job timed out (set from lockup_timeout kernel parameter), What KMD should do in TDR routine :


1.        Update adev->gpu_reset_counter, and stop scheduler first, (gpu_reset_counter is used to force vm flush after GPU reset, out of this threadâ??s scope so no more discussion on it)

2.        Set its fence error status to â??ETIMEâ??,

3.        Find the entity/ctx behind this job, and set this ctx as â??guiltyâ??

4.        Kick out this job from schedulerâ??s mirror list, so this job wonâ??t get re-scheduled to ring anymore.

5.        Kick out all jobs in this â??guiltyâ?? ctxâ??s KFIFO queue, and set all their fence status to â??ECANCELEDâ??

6.       Force signal all fences that get kicked out by above two steps, otherwise UMD will block forever if waiting on those fences

7.        Do gpu reset, which is can be some callbacks to let bare-metal and SR-IOV implement with their favor style

8.        After reset, KMD need to aware if the VRAM lost happens or not, bare-metal can implement some function to judge, while for SR-IOV I prefer to read it from GIM side (for initial version we consider itâ??s always VRAM lost, till GIM side change aligned)

9.        If VRAM lost not hit, continue, otherwise:

a)       Update adev->vram_lost_counter,

b)       Iterate over all living ctx, and set all ctx as â??guiltyâ?? since VRAM lost actually ruins all VRAM contents

c)        Kick out all jobs in all ctxâ??s KFIFO queue, and set all their fence status to â??ECANCELDEDâ??

10.     Do GTT recovery and VRAM page tables/entries recovery (optional, do we need it ???)

11.     Re-schedule all JOBs remains in mirror list to ring again and restart scheduler (for VRAM lost case, no JOB will re-scheduled)


l  For cs_wait() IOCTL:
After it found fence signaled, it should check with â??dma_fence_get_statusâ?? to see if there is error there,
And return the error status of fence


l  For cs_wait_fences() IOCTL:
Similar with above approach


l  For cs_submit() IOCTL:
It need to check if current ctx been marked as â??guiltyâ?? and return â??ECANCELEDâ?? if so


l  Introduce a new IOCTL to let UMD query vram_lost_counter:
This way, UMD can also block app from submitting, like @Nicolai mentioned, we can cache one copy of vram_lost_counter when enumerate physical device, and deny all
gl-context from submitting if the counter queried bigger than that one cached in physical device. (looks a little overkill to me, but easy to implement )
UMD can also return error to APP when creating gl-context if found current queried vram_lost_counter bigger than that one cached in physical device.

BTW: I realized that gl-context is a little different with kernelâ??s context. Because for kernel. BO is not related with context but only with FD, while in UMD, BO have a backend
gl-context, so block submitting in UMD layer is also needed although KMD will do its job as bottom line


l  Basically â??vram_lost_counterâ?? is exposure by kernel to let UMD take the control of robust extension feature, it will be UMDâ??s call to move, KMD only deny â??guiltyâ?? context from submitting


Need your feedback, thx

Weâ??d better make TDR feature landed ASAP

BR Monk




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20171011/1b78bd97/attachment-0001.html>


[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux