> [ML] I think context is better than entity, because for example if you > only block entity_0 of context and allow entity_N run, that means the > dependency between entities are broken (e.g. page table updates in > > Sdma entity pass but gfx submit in GFX entity blocked, not make sense > to me) > > Weâ??d better either block the whole context or let notâ?¦ > Page table updates are not part of any context. So I think the only thing we can do is to mark the entity as not scheduled any more. > 1.Kick out all jobs in this â??guiltyâ?? ctxâ??s KFIFO queue, and set all > their fence status to â??*ECANCELED*â?? > > Setting ECANCELED should be ok. But I think we should do this when we > try to run the jobs and not during GPU reset. > > [ML] without deep thought and expritment, Iâ??m not sure the difference > between them, but kick it out in gpu_reset routine is more efficient, > I really don't think so. Kicking them out during gpu_reset sounds racy to me once more. And marking them canceled when we try to run them has the clear advantage that all dependencies are meet first. > ML: KMD mark all contexts as guilty is because that way we can unify > our IOCTL behavior: e.g. for IOCTL only block â??guiltyâ??context , no > need to worry about vram-lost-counter anymore, thatâ??s a implementation > style. I donâ??t think it is related with UMD layer, > I don't think that this is a good idea. Instead when you want to unify the behavior we should use the vram_lost_counter as marker for the guilty context. Regards, Christian. Am 11.10.2017 um 10:48 schrieb Liu, Monk: > > On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so > it's reasonable to use it. However, it /does not/ make sense to mark > idle contexts as "guilty" just because VRAM is lost. VRAM lost is a > perfect example where the driver should report context lost to > applications with the "innocent" flag for contexts that were idle at > the time of reset. The only context(s) that should be reported as > "guilty" (or perhaps "unknown" in some cases) are the ones that were > executing at the time of reset. > > ML: KMD mark all contexts as guilty is because that way we can unify > our IOCTL behavior: e.g. for IOCTL only block â??guiltyâ??context , no > need to worry about vram-lost-counter anymore, thatâ??s a implementation > style. I donâ??t think it is related with UMD layer, > > For UMD the gl-context isnâ??t aware of by KMD, so UMD can implement it > own â??guiltyâ?? gl-context if you want. > > If KMD doesnâ??t mark all ctx as guilty after VRAM lost, can you > illustrate what rule KMD should obey to check in KMS IOCTL like > cs_sumbit ?? letâ??s see which way better > > *From:*Haehnle, Nicolai > *Sent:* Wednesday, October 11, 2017 4:41 PM > *To:* Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian > <Christian.Koenig at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; > Deucher, Alexander <Alexander.Deucher at amd.com> > *Cc:* amd-gfx at lists.freedesktop.org; Ding, Pixel <Pixel.Ding at amd.com>; > Jiang, Jerry (SW) <Jerry.Jiang at amd.com>; Li, Bingley > <Bingley.Li at amd.com>; Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; > Filipas, Mario <Mario.Filipas at amd.com> > *Subject:* Re: TDR and VRAM lost handling in KMD: > > From a Mesa perspective, this almost all sounds reasonable to me. > > On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so > it's reasonable to use it. However, it /does not/ make sense to mark > idle contexts as "guilty" just because VRAM is lost. VRAM lost is a > perfect example where the driver should report context lost to > applications with the "innocent" flag for contexts that were idle at > the time of reset. The only context(s) that should be reported as > "guilty" (or perhaps "unknown" in some cases) are the ones that were > executing at the time of reset. > > On whether the whole context is marked as guilty from a user space > perspective, it would simply be nice for user space to get consistent > answers. It would be a bit odd if we could e.g. succeed in submitting > an SDMA job after a GFX job was rejected. This would point in favor of > marking the entire context as guilty (although that could happen > lazily instead of at reset time). On the other hand, if that's too big > a burden for the kernel implementation I'm sure we can live without it. > > Cheers, > > Nicolai > > ------------------------------------------------------------------------ > > *From:*Liu, Monk > *Sent:* Wednesday, October 11, 2017 10:15:40 AM > *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, > Alexander > *Cc:* amd-gfx at lists.freedesktop.org > <mailto:amd-gfx at lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry > (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario > *Subject:* RE: TDR and VRAM lost handling in KMD: > > 1.Set its fence error status to â??*ETIME*â??, > > No, as I already explained ETIME is for synchronous operation. > > In other words when we return ETIME from the wait IOCTL it would mean > that the waiting has somehow timed out, but not the job we waited for. > > Please use ECANCELED as well or some other error code when we find > that we need to distinct the timedout job from the canceled ones > (probably a good idea, but I'm not sure). > > [ML] Iâ??m okay if you insist not to use ETIME > > 1.Find the entity/ctx behind this job, and set this ctx as â??*guilty*â?? > > Not sure. Do we want to set the whole context as guilty or just the > entity? > > Setting the whole contexts as guilty sounds racy to me. > > BTW: We should use a different name than "guilty", maybe just "bool > canceled;" ? > > [ML] I think context is better than entity, because for example if you > only block entity_0 of context and allow entity_N run, that means the > dependency between entities are broken (e.g. page table updates in > > Sdma entity pass but gfx submit in GFX entity blocked, not make sense > to me) > > Weâ??d better either block the whole context or let notâ?¦ > > 1.Kick out all jobs in this â??guiltyâ?? ctxâ??s KFIFO queue, and set all > their fence status to â??*ECANCELED*â?? > > Setting ECANCELED should be ok. But I think we should do this when we > try to run the jobs and not during GPU reset. > > [ML] without deep thought and expritment, Iâ??m not sure the difference > between them, but kick it out in gpu_reset routine is more efficient, > > Otherwise you need to check context/entity guilty flag in run_job > routine â?¦and you need to it for every context/entity, I donâ??t see why > > We donâ??t just kickout all of them in gpu_reset stage â?¦. > > a)Iterate over all living ctx, and set all ctx as â??*guilty*â?? since > VRAM lost actually ruins all VRAM contents > > No, that shouldn't be done by comparing the counters. Iterating over > all contexts is way to much overhead. > > [ML] because I want to make KMS IOCTL rules clean, like they donâ??t > need to differentiate VRAM lost or not, they only interested in if the > context is guilty or not, and block > > Submit for guilty ones. > > *Can you give more details of your idea? And better the detail > implement in cs_submit, I want to see how you want to block submit > without checking context guilty flag* > > a)Kick out all jobs in all ctxâ??s KFIFO queue, and set all their fence > status to â??*ECANCELDED*â?? > > Yes and no, that should be done when we try to run the jobs and not > during GPU reset. > > [ML] again, kicking out them in gpu reset routine is high efficient, > otherwise you need check on every job in run_job() > > Besides, can you illustrate the detail implementation ? > > Yes and no, dma_fence_get_status() is some specific handling for > sync_file debugging (no idea why that made it into the common fence code). > > It was replaced by putting the error code directly into the fence, so > just reading that one after waiting should be ok. > > Maybe we should fix dma_fence_get_status() to do the right thing for this? > > [ML] yeah, thatâ??s too confusing, the name sound really the one I want > to use, we should change itâ?¦ > > *But look into the implement, I don**â??t see why we cannot use it ? it > also finally return the fence->error * > > *From:*Koenig, Christian > *Sent:* Wednesday, October 11, 2017 3:21 PM > *To:* Liu, Monk <Monk.Liu at amd.com <mailto:Monk.Liu at amd.com>>; Haehnle, > Nicolai <Nicolai.Haehnle at amd.com <mailto:Nicolai.Haehnle at amd.com>>; > Olsak, Marek <Marek.Olsak at amd.com <mailto:Marek.Olsak at amd.com>>; > Deucher, Alexander <Alexander.Deucher at amd.com > <mailto:Alexander.Deucher at amd.com>> > *Cc:* amd-gfx at lists.freedesktop.org > <mailto:amd-gfx at lists.freedesktop.org>; Ding, Pixel > <Pixel.Ding at amd.com <mailto:Pixel.Ding at amd.com>>; Jiang, Jerry (SW) > <Jerry.Jiang at amd.com <mailto:Jerry.Jiang at amd.com>>; Li, Bingley > <Bingley.Li at amd.com <mailto:Bingley.Li at amd.com>>; Ramirez, Alejandro > <Alejandro.Ramirez at amd.com <mailto:Alejandro.Ramirez at amd.com>>; > Filipas, Mario <Mario.Filipas at amd.com <mailto:Mario.Filipas at amd.com>> > *Subject:* Re: TDR and VRAM lost handling in KMD: > > See inline: > > Am 11.10.2017 um 07:33 schrieb Liu, Monk: > > Hi Christian & Nicolai, > > We need to achieve some agreements on what should MESA/UMD do and > what should KMD do, *please give your comments with â??okayâ?? or â??Noâ?? > and your idea on below items,* > > ?When a job timed out (set from lockup_timeout kernel parameter), > What KMD should do in TDR routine : > > 1.Update adev->*gpu_reset_counter*, and stop scheduler first, > (*gpu_reset_counter* is used to force vm flush after GPU reset, > out of this threadâ??s scope so no more discussion on it) > > Okay. > > 2.Set its fence error status to â??*ETIME*â??, > > No, as I already explained ETIME is for synchronous operation. > > In other words when we return ETIME from the wait IOCTL it would mean > that the waiting has somehow timed out, but not the job we waited for. > > Please use ECANCELED as well or some other error code when we find > that we need to distinct the timedout job from the canceled ones > (probably a good idea, but I'm not sure). > > 3.Find the entity/ctx behind this job, and set this ctx as â??*guilty*â?? > > Not sure. Do we want to set the whole context as guilty or just the > entity? > > Setting the whole contexts as guilty sounds racy to me. > > BTW: We should use a different name than "guilty", maybe just "bool > canceled;" ? > > 4.Kick out this job from schedulerâ??s mirror list, so this job > wonâ??t get re-scheduled to ring anymore. > > Okay. > > 5.Kick out all jobs in this â??guiltyâ?? ctxâ??s KFIFO queue, and set > all their fence status to â??*ECANCELED*â?? > > Setting ECANCELED should be ok. But I think we should do this when we > try to run the jobs and not during GPU reset. > > 6.Force signal all fences that get kicked out by above two > steps,*otherwise UMD will block forever if waiting on those fences* > > Okay. > > 7.Do gpu reset, which is can be some callbacks to let bare-metal > and SR-IOV implement with their favor style > > Okay. > > 8.After reset, KMD need to aware if the VRAM lost happens or not, > bare-metal can implement some function to judge, while for SR-IOV > I prefer to read it from GIM side (for initial version we consider > itâ??s always VRAM lost, till GIM side change aligned) > > Okay. > > 9.If VRAM lost not hit, continue, otherwise: > > a)Update adev->*vram_lost_counter*, > > Okay. > > b)Iterate over all living ctx, and set all ctx as â??*guilty*â?? since > VRAM lost actually ruins all VRAM contents > > No, that shouldn't be done by comparing the counters. Iterating over > all contexts is way to much overhead. > > c)Kick out all jobs in all ctxâ??s KFIFO queue, and set all their > fence status to â??*ECANCELDED*â?? > > Yes and no, that should be done when we try to run the jobs and not > during GPU reset. > > 10.Do GTT recovery and VRAM page tables/entries recovery > (optional, do we need it ???) > > Yes, that is still needed. As Nicolai explained we can't be sure that > VRAM is still 100% correct even when it isn't cleared. > > 11.Re-schedule all JOBs remains in mirror list to ring again and > restart scheduler (for VRAM lost case, no JOB will re-scheduled) > > Okay. > > ?For cs_wait() IOCTL: > > After it found fence signaled, it should check with > *â??dma_fence_get_statusâ?? *to see if there is error there, > > And return the error status of fence > > Yes and no, dma_fence_get_status() is some specific handling for > sync_file debugging (no idea why that made it into the common fence code). > > It was replaced by putting the error code directly into the fence, so > just reading that one after waiting should be ok. > > Maybe we should fix dma_fence_get_status() to do the right thing for this? > > ?For cs_wait_fences() IOCTL: > > Similar with above approach > > ?For cs_submit() IOCTL: > > It need to check if current ctx been marked as â??*guilty*â?? and > return â??*ECANCELED*â?? if so > > ?Introduce a new IOCTL to let UMD query *vram_lost_counter*: > > This way, UMD can also block app from submitting, like @Nicolai > mentioned, we can cache one copy of *vram_lost_counter* when > enumerate physical device, and deny all > > gl-context from submitting if the counter queried bigger than that > one cached in physical device. (looks a little overkill to me, but > easy to implement ) > > UMD can also return error to APP when creating gl-context if found > current queried*vram_lost_counter *bigger than that one cached in > physical device. > > Okay. Already have a patch for this, please review that one if you > haven't already done so. > > Regards, > Christian. > > BTW: I realized that gl-context is a little different with > kernelâ??s context. Because for kernel. BO is not related with > context but only with FD, while in UMD, BO have a backend > > gl-context, so block submitting in UMD layer is also needed > although KMD will do its job as bottom line > > ?Basically â??vram_lost_counterâ?? is exposure by kernel to let UMD > take the control of robust extension feature, it will be UMDâ??s > call to move, KMD only deny â??guiltyâ?? context from submitting > > Need your feedback, thx > > Weâ??d better make TDR feature landed ASAP > > BR Monk > -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20171011/c674e13b/attachment-0001.html>