deprecated register issues

ckoenig.leichtzumerken@xxxxxxxxx (Christian König) · Thu, 8 Mar 2018 10:40:36 +0100

Hi Monk,

> While we can avoid such vm flush failure by stitch together of the 
> sending REQ and reading ACK part, at least for compute ring this is 
> confirmed.
Well there are two misunderstanding here.

First of all this solution doesn't really work, it just hides the 
problem because we don't do a world switch in between those two packets 
any more. And while we could change the SDMA, UVD and VCE firmware do to 
something similar you can't apply this solution to CPU based flushes.

The second issue is that this isn't related to VMHUB flushing at all, 
it's just that VMHUB flushing is the first thing where you notice that 
something is wrong.

The real problem is that when you access CC_RB_BACKEND_DISABLE and a 
bunch of other registers the bus on Vega10 sometimes gets a hickup and 
drops other reads and writes.

So we need to identify those registers and removes all accesses to them, 
otherwise working with the hardware will just be horrible unreliable in 
general.

Regards,
Christian.

Am 08.03.2018 um 04:05 schrieb Liu, Monk:
>
> Hi Alex
>
> While we can avoid such vm flush failure by stitch together of the 
> sending REQ and reading ACK part, at least for compute ring this is 
> confirmed.
>
> And I believe for SDMA ring (even UVD/VCE ring) it could also be achieved.
>
> But @Koenig, Christian <mailto:Christian.Koenig at amd.com> insist 
> stitching together the REQ AND ACK part is not a formal way to fix the 
> issue, instead just a walkaround and I cannot debate that
>
> What make me worry more is what if there are more registers like Alex 
> said that behaves like this CC_RB_BACKEND_DISABLE,
>
> since we donâ??t know their names(too hard to filter them out!) so we 
> couldnâ??t remove them all from SR list,
>
> So I still think we need plan B to handle above case, Â A.K.A use one 
> package for the REQ and ACK job
>
> /Monk
>
> *From:*Deucher, Alexander
> *Sent:* 2018å¹´3æ??8æ?¥10:53
> *To:* Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian 
> <Christian.Koenig at amd.com>; Mao, David <David.Mao at amd.com>
> *Cc:* amd-gfx at lists.freedesktop.org; Jin, Jian-Rong 
> <Jian-Rong.Jin at amd.com>
> *Subject:* Re: deprecated register issues
>
> I think there are more than just CC_RB_BACKEND_DISABLE that could 
> cause this problem. IIRC, some entire class of gfx registers could 
> cause it, it just happened that this was one of the only ones we 
> readback via mmio.Â  Also for the save and restore list, I think the 
> RLC uses a different interface to read back the registers so it may 
> not be affected the same way.
>
> Alex
>
> ------------------------------------------------------------------------
>
> *From:*Liu, Monk
> *Sent:* Wednesday, March 7, 2018 9:42:41 PM
> *To:* Deucher, Alexander; Koenig, Christian; Mao, David
> *Cc:* amd-gfx at lists.freedesktop.org 
> <mailto:amd-gfx at lists.freedesktop.org>; Jin, Jian-Rong
> *Subject:* RE: deprecated register issues
>
> Hi guys
>
> According to Christianâ??s found, reading this register would make vm 
> hub failed to finish the vm flush request , e.g.: sdma is doing vm 
> flush which first write data to vm_invalidat_req and read result from 
> vm_invalidate_ack, but found driver will forever failed to get the 
> correct value from vm_invalidate_ack if the meantime BIF is reading 
> this CC_RB_BACKEND_DISABLE register.
>
> Now SR-IOV world switch also may get such similar trouble, see below 
> save_restore_list ( during world_switch, RLCV will save current VFâ??s 
> register according to this list and restore all those registers when 
> loading back this VF)
>
> uint32 register_restore[] = {
>
> (uint32)((0x3000 << 18) | mmPA_SC_FIFO_SIZE),Â Â  /* SCÂ Â  */
>
> 0x00000001,
>
> *(uint32)((0x3000 << 18) | mmCC_RB_BACKEND_DISABLE),Â Â  /* SC SC 
> PER_SEÂ  */*
>
> *0x00000000,*
>
> *(uint32)((0x3400 << 18) | mmCC_RB_BACKEND_DISABLE),Â Â  /* SC SC 
> PER_SEÂ  */*
>
> *0x00000000,*
>
> *(uint32)((0x3800 << 18) | mmCC_RB_BACKEND_DISABLE),Â Â  /* SC SC 
> PER_SEÂ  */*
>
> *0x00000000,*
>
> *(uint32)((0x3c00 << 18) | mmCC_RB_BACKEND_DISABLE),Â Â  /* SC SC 
> PER_SEÂ  */*
>
> *0x00000000,*
>
> (uint32)((0x3000 << 18) | mmVGT_VTX_VECT_EJECT_REG),
>
> 0x00000001,
>
> (uint32)((0x3000 << 18) | mmVGT_DMA_DATA_FIFO_DEPTH),Â Â  /* IA WDÂ  */
>
> 0x00000001,
>
> (uint32)((0x3000 << 18) | mmVGT_DMA_REQ_FIFO_DEPTH),Â Â  /* WDÂ Â  */
>
> 0x00000001,
>
> (uint32)((0x3000 << 18) | mmVGT_DRAW_INIT_FIFO_DEPTH),Â Â  /* WDÂ Â  */
>
> 0x00000001,
>
> (uint32)((0x3000 << 18) | mmVGT_CACHE_INVALIDATION),Â Â  /*Â  IAÂ  */
>
> 0x00000001,
>
> (uint32)((0x3000 << 18) | mmVGT_RESET_DEBUG), /*Â  WDÂ  */
>
> 0x00000001,
>
> (uint32)((0x3000 << 18) | mmVGT_FIFO_DEPTHS),
>
> I will do some test against this CC_RB_BACKEND_DISABLE register, see 
> if vm flush failure issue could be avoided by removing those four 
> register from SR list
>
> Thanks
>
> /Monk
>
> *From:*Deucher, Alexander
> *Sent:* 2018å¹´3æ??7æ?¥23:13
> *To:* Koenig, Christian <Christian.Koenig at amd.com 
> <mailto:Christian.Koenig at amd.com>>; Mao, David <David.Mao at amd.com 
> <mailto:David.Mao at amd.com>>; Liu, Monk <Monk.Liu at amd.com 
> <mailto:Monk.Liu at amd.com>>
> *Cc:* amd-gfx at lists.freedesktop.org 
> <mailto:amd-gfx at lists.freedesktop.org>; Jin, Jian-Rong 
> <Jian-Rong.Jin at amd.com <mailto:Jian-Rong.Jin at amd.com>>
> *Subject:* Re: deprecated register issues
>
> Right.Â  We ran into issues with reading back that register at runtime 
> when UMDs queried it when other stuff was in flight, so we just read 
> it once at startup and cache the results. Now when UMDs request it, we 
> return the cached value.
>
> Alex
>
> ------------------------------------------------------------------------
>
> *From:*Koenig, Christian
> *Sent:* Wednesday, March 7, 2018 9:31:13 AM
> *To:* Mao, David; Liu, Monk
> *Cc:* Deucher, Alexander; amd-gfx at lists.freedesktop.org 
> <mailto:amd-gfx at lists.freedesktop.org>; Jin, Jian-Rong
> *Subject:* Re: deprecated register issues
>
> Hi David,
>
> well I just figured that this is a misunderstanding.
>
> Accessing this register and some other deprecated registers can cause 
> problem when invalidating VMHUBs.
>
> This register itself isn't deprecated, the wording in a patch fixing 
> things is just a bit unclear.
>
> Question is is that register still accessed regularly or is it value 
> cached after startup?
>
> Regards,
> Christian.
>
> Am 07.03.2018 um 15:25 schrieb Mao, David:
>
>     We requires base driver to provide the mask of disabled RB.
>
>     This is why kernel read the CC_RB_BACKEND_DISABLE to collect the
>     harvest configuration.
>
>     Where did you get to know that the register is deprecated?
>
>     I think it should still be there.
>
>     Best Regards,
>
>     David
>
>         On Mar 7, 2018, at 9:49 PM, Liu, Monk <Monk.Liu at amd.com
>         <mailto:Monk.Liu at amd.com>> wrote:
>
>         + UMD guys
>
>         Hi David
>
>         Do you know if*GC_USER_RB_BACKEND_DISABLE is still exist for
>         gfx9/vega10 ?*
>
>         **
>
>         *We found*CC_RB_BACKEND_DISABLE was deprecated but looks it is
>         still in use in kmd, so
>
>         I want to check with you both of above registers
>
>         Thanks
>
>         /Monk
>
>         *From:*amd-gfx
>         [mailto:amd-gfx-bounces at lists.freedesktop.org]*On Behalf
>         Of*Christian K?nig
>         *Sent:*2018å¹´3æ??7æ?¥20:26
>         *To:*Liu, Monk <Monk.Liu at amd.com <mailto:Monk.Liu at amd.com>>;
>         Deucher, Alexander <Alexander.Deucher at amd.com
>         <mailto:Alexander.Deucher at amd.com>>
>         *Cc:*amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>
>         *Subject:*Re: deprecated register issues
>
>         Hi Monk,
>
>         I honestly don't have the slightest idea why we are still
>         accessing CC_RB_BACKEND_DISABLE. Maybe it still contains some
>         useful values?
>
>         Key point was that we needed to stop accessing it all the time
>         to avoid triggering problems.
>
>         Regards,
>         Christian.
>
>         Am 07.03.2018 um 13:11 schrieb Liu, Monk:
>
>             Hi Christian
>
>             I remember you and AlexD mentioned that a handful
>             registers are deprecated for greenland (gfx9)
>
>             e.g. CC_RB_BACKEND_DISABLE
>
>             do you know why we still have this routine ?
>
>             staticu32
>
>             gfx_v9_0_get_rb_active_bitmap(structamdgpu_device *adev)
>
>             {
>
>             Â Â Â Â u32 data, mask;
>
>             Â Â Â Â data =RREG32_SOC15(GC,
>
>             0, mmCC_RB_BACKEND_DISABLE);
>
>             Â Â Â Â data |=RREG32_SOC15(GC,
>
>             0, mmGC_USER_RB_BACKEND_DISABLE);
>
>             Â Â Â Â data &= CC_RB_BACKEND_DISABLE__BACKEND_DISABLE_MASK;
>
>             Â Â Â Â data >>=
>             GC_USER_RB_BACKEND_DISABLE__BACKEND_DISABLE__SHIFT;
>
>             Â Â Â Â mask
>             =amdgpu_gfx_create_bitmask(adev->gfx.config.max_backends_per_se/
>
>             adev->gfx.config.max_sh_per_se);
>
>             return(~data) & mask;
>
>             }
>
>             see that it still readÂ CC_RB_BACKEND_DISABLE
>
>             thanks
>
>             /Monk
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20180308/278f3841/attachment-0001.html>