TDR and VRAM lost handling in KMD:

Monk.Liu@xxxxxxx (Liu, Monk) · Wed, 11 Oct 2017 09:27:00 +0000

ML: KMD mark all contexts as guilty is because that way we can unify our IOCTL behavior: e.g. for IOCTL only block â??guiltyâ??context , no need to worry about vram-lost-counter anymore, thatâ??s a implementation style. I donâ??t think it is related with UMD layer,
I don't think that this is a good idea. Instead when you want to unify the behavior we should use the vram_lost_counter as marker for the guilty context.

[ML] say that we only block at entity level, then we have two rules:

1)      we block submit for â??guiltyâ?? entity in run_job routine. (and mark as guilty entity in gpu_reset)

2)      for innocent entity, we still need to check vram_lost_counter in cs_submit, correct ?

besides: Nicolai reminded me that we have amdgpu_ctx_query() to worry about ..
when we mark some entity as â??guiltyâ??, do we need to mark the context behind it as â??AMDGPU_CTX_GUILTY_RESETâ?? ?

this thing I didnâ??t think of â?¦ I just ignored it â?¦.

BR Monk
From: Koenig, Christian
Sent: Wednesday, October 11, 2017 5:03 PM
To: Liu, Monk <Monk.Liu at amd.com>; Haehnle, Nicolai <Nicolai.Haehnle at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>
Cc: amd-gfx at lists.freedesktop.org; Ding, Pixel <Pixel.Ding at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com>; Li, Bingley <Bingley.Li at amd.com>; Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; Filipas, Mario <Mario.Filipas at amd.com>
Subject: Re: TDR and VRAM lost handling in KMD:

[ML] I think context is better than entity, because for example if you only block entity_0 of context and allow entity_N run, that means the dependency between entities are broken (e.g. page table updates in
Sdma entity pass but gfx submit in GFX entity blocked, not make sense to me)
Weâ??d better either block the whole context or let notâ?¦
Page table updates are not part of any context.

So I think the only thing we can do is to mark the entity as not scheduled any more.

1.        Kick out all jobs in this â??guiltyâ?? ctxâ??s KFIFO queue, and set all their fence status to â??ECANCELEDâ??
Setting ECANCELED should be ok. But I think we should do this when we try to run the jobs and not during GPU reset.

[ML] without deep thought and expritment, Iâ??m not sure the difference between them, but kick it out in gpu_reset routine is more efficient,
I really don't think so. Kicking them out during gpu_reset sounds racy to me once more.

And marking them canceled when we try to run them has the clear advantage that all dependencies are meet first.

ML: KMD mark all contexts as guilty is because that way we can unify our IOCTL behavior: e.g. for IOCTL only block â??guiltyâ??context , no need to worry about vram-lost-counter anymore, thatâ??s a implementation style. I donâ??t think it is related with UMD layer,
I don't think that this is a good idea. Instead when you want to unify the behavior we should use the vram_lost_counter as marker for the guilty context.

Regards,
Christian.

Am 11.10.2017 um 10:48 schrieb Liu, Monk:

On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so it's reasonable to use it. However, it does not make sense to mark idle contexts as "guilty" just because VRAM is lost. VRAM lost is a perfect example where the driver should report context lost to applications with the "innocent" flag for contexts that were idle at the time of reset. The only context(s) that should be reported as "guilty" (or perhaps "unknown" in some cases) are the ones that were executing at the time of reset.

ML: KMD mark all contexts as guilty is because that way we can unify our IOCTL behavior: e.g. for IOCTL only block â??guiltyâ??context , no need to worry about vram-lost-counter anymore, thatâ??s a implementation style. I donâ??t think it is related with UMD layer,
For UMD the gl-context isnâ??t aware of by KMD, so UMD can implement it own â??guiltyâ?? gl-context if you want.

If KMD doesnâ??t mark all ctx as guilty after VRAM lost, can you illustrate what rule KMD should obey to check in KMS IOCTL like cs_sumbit ?? letâ??s see which way better

From: Haehnle, Nicolai
Sent: Wednesday, October 11, 2017 4:41 PM
To: Liu, Monk <Monk.Liu at amd.com><mailto:Monk.Liu at amd.com>; Koenig, Christian <Christian.Koenig at amd.com><mailto:Christian.Koenig at amd.com>; Olsak, Marek <Marek.Olsak at amd.com><mailto:Marek.Olsak at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com><mailto:Alexander.Deucher at amd.com>
Cc: amd-gfx at lists.freedesktop.org<mailto:amd-gfx at lists.freedesktop.org>; Ding, Pixel <Pixel.Ding at amd.com><mailto:Pixel.Ding at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com><mailto:Jerry.Jiang at amd.com>; Li, Bingley <Bingley.Li at amd.com><mailto:Bingley.Li at amd.com>; Ramirez, Alejandro <Alejandro.Ramirez at amd.com><mailto:Alejandro.Ramirez at amd.com>; Filipas, Mario <Mario.Filipas at amd.com><mailto:Mario.Filipas at amd.com>
Subject: Re: TDR and VRAM lost handling in KMD: