=?gb18030?b?u9i4tKO6ILvYuLSjuiC72Li0o7ogu9i4tKO6IEJ1?==?gb18030?b?ZzogYW1kZ3B1IGRybSBkcml2ZXIgY2F1c2UgcHJv?==?gb18030?b?Y2VzcyBpbnRvIERpc2sgc2xlZXAgc3RhdGU=?=

"=?gb18030?b?eWFuaHVh?=" <78666679@xxxxxx> · Wed, 11 Sep 2019 10:43:26 +0800




I'm leaving out for some days.  Thanks very much for your detailed answer.

Best Regards.
Yanhua


------------------ 原始邮件 ------------------
发件人: "Koenig, Christian"<Christian.Koenig@xxxxxxx>;
发送时间: 2019年9月6日(星期五) 晚上7:23
收件人: "yanhua"<78666679@xxxxxx>;"amd-gfx"<amd-gfx@xxxxxxxxxxxxxxxxxxxxx>;
抄送: "Deucher, Alexander"<Alexander.Deucher@xxxxxxx>;
主题: Re: 回复： 回复： 回复： Bug: amdgpu drm driver cause process into Disk sleep state







Are there anything I have missed ?


Yeah, unfortunately quite a bunch of things. The fact that arm64 doesn't support the PCIe NoSnoop TLP attribute is only the tip of the iceberg.



You need a full "recent" driver stack, e.g. not older than a few month till a year, for this to work. And not only the kernel, but also recent userspace components.



Maybe that's something you could first, e.g. install a recent version of Mesa and/or tell Mesa to not use the SDMA at all. But since you are running into an SDMA lockup with a kernel triggered page table update I see little chance that this work.



The only other alternative I can see is the DKMS package of the pro-driver. With that one you might be able to compile the recent driver for an older kernel version.



But I can't guarantee at all that this actually works on ARM64.



Sorry that I don't have better news for you,

Christian.



Am 05.09.19 um 03:36 schrieb yanhua:



Hi, Christian,
        I noticed that you said  'amdgpu is known to not work on arm64 until very recently'.    I found the CPU related commit with drm is "drm: disable uncached DMA optimization for ARM and arm64". 



@@ -47,6 +47,24 @@ static inline bool drm_arch_can_wc_memory(void)

        return false;

 #elif defined(CONFIG_MIPS) && defined(CONFIG_CPU_LOONGSON3)

        return false;

+#elif defined(CONFIG_ARM) || defined(CONFIG_ARM64)

+       /*

+        * The DRM driver stack is designed to work with cache coherent devices

+        * only, but permits an optimization to be enabled in some cases, where

+        * for some buffers, both the CPU and the GPU use uncached mappings,

+        * removing the need for DMA snooping and allocation in the CPU caches.

+        *

+        * The use of uncached GPU mappings relies on the correct implementation

+        * of the PCIe NoSnoop TLP attribute by the platform, otherwise the GPU

+        * will use cached mappings nonetheless. On x86 platforms, this does not

+        * seem to matter, as uncached CPU mappings will snoop the caches in any

+        * case. However, on ARM and arm64, enabling this optimization on a

+        * platform where NoSnoop is ignored results in loss of coherency, which

+        * breaks correct operation of the device. Since we have no way of

+        * detecting whether NoSnoop works or not, just disable this

+        * optimization entirely for ARM and arm64.

+        */

+       return false;

 #else

        return true;

 #endif




The real effect is to  in amdgpu_object.c





   if (!drm_arch_can_wc_memory())

                bo->flags &= ~AMDGPU_GEM_CREATE_CPU_GTT_USWC;





And we have AMDGPU_GEM_CREATE_CPU_GTT_USWC turned off in our 4.19.36 kernel, So I think this is not  the cause of my bug.  Are there anything I have missed ?



I had suggest the machine supplier to use a more newer kernel such as 5.2.2, But they failed to do so after some try.  We also backport a series patches from newer kernel. But still we get the bad ring timeout.



We have dived into the amdgpu drm driver a long time, bu it is really difficult for me, especially the hardware related ring timeout.



------------------
Yanhua







------------------ 原始邮件 ------------------

发件人: "Koenig, Christian"<Christian.Koenig@xxxxxxx>;
发送时间: 2019年9月3日(星期二) 晚上9:19
收件人: "yanhua"<78666679@xxxxxx>;"amd-gfx"<amd-gfx@xxxxxxxxxxxxxxxxxxxxx>;
抄送: "Deucher, Alexander"<Alexander.Deucher@xxxxxxx>;
主题: Re: 回复： 回复： Bug: amdgpu drm driver cause process into Disk sleep state




This is just a GPU lock, please open up a bug report on freedesktop.org and attach the full dmesg and which version of Mesa you are using.



Regards,

Christian.



Am 03.09.19 um 15:16 schrieb 78666679:



Yes, with dmesg|grep drm ,  I get following.



348571.880718] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=24423862, emitted seq=24423865








------------------ 原始邮件 ------------------

发件人: "Koenig, Christian"<Christian.Koenig@xxxxxxx>;
发送时间: 2019年9月3日(星期二) 晚上9:07
收件人: ""<78666679@xxxxxx>;"amd-gfx"<amd-gfx@xxxxxxxxxxxxxxxxxxxxx>;
抄送: "Deucher, Alexander"<Alexander.Deucher@xxxxxxx>;
主题: Re: 回复： Bug: amdgpu drm driver cause process into Disk sleep state




Well that looks like the hardware got stuck.



Do you get something in the locks about a timeout on the SDMA ring?



Regards,

Christian.



Am 03.09.19 um 14:50 schrieb 78666679:



Hi Christian,
       Sometimes the thread blocked  disk sleeping in call to amdgpu_sa_bo_new. following is the stack trace.  it seems the sa bo is used up ,  so  the caller blocked waiting someone to free sa resources.






D 206833 227656 [surfaceflinger] <defunct> Binder:45_5
cat /proc/206833/task/227656/stack



[<0>] __switch_to+0x94/0xe8

[<0>] dma_fence_wait_any_timeout+0x234/0x2d0

[<0>] amdgpu_sa_bo_new+0x468/0x540 [amdgpu]

[<0>] amdgpu_ib_get+0x60/0xc8 [amdgpu]

[<0>] amdgpu_job_alloc_with_ib+0x70/0xb0 [amdgpu]

[<0>] amdgpu_vm_bo_update_mapping+0x2e0/0x3d8 [amdgpu]

[<0>] amdgpu_vm_bo_update+0x2a0/0x710 [amdgpu]

[<0>] amdgpu_gem_va_ioctl+0x46c/0x4c8 [amdgpu]

[<0>] drm_ioctl_kernel+0x94/0x118 [drm]

[<0>] drm_ioctl+0x1f0/0x438 [drm]

[<0>] amdgpu_drm_ioctl+0x58/0x90 [amdgpu]

[<0>] do_vfs_ioctl+0xc4/0x8c0

[<0>] ksys_ioctl+0x8c/0xa0

[<0>] __arm64_sys_ioctl+0x28/0x38

[<0>] el0_svc_common+0xa0/0x180

[<0>] el0_svc_handler+0x38/0x78

[<0>] el0_svc+0x8/0xc

[<0>] 0xffffffffffffffff








--------------------
YanHua






------------------ 原始邮件 ------------------

发件人: "Koenig, Christian"<Christian.Koenig@xxxxxxx>;
发送时间: 2019年9月3日(星期二) 下午4:21
收件人: ""<78666679@xxxxxx>;"amd-gfx"<amd-gfx@xxxxxxxxxxxxxxxxxxxxx>;
抄送: "Deucher, Alexander"<Alexander.Deucher@xxxxxxx>;
主题: Re: Bug: amdgpu drm driver cause process into Disk sleep state




Hi Yanhua,



please update your kernel first, cause that looks like a known issue 

which was recently fixed by patch "drm/scheduler: use job count instead 

of peek".



Probably best to try the latest bleeding edge kernel and if that doesn't 

help please open up a bug report on 
https://bugs.freedesktop.org/.



Regards,

Christian.



Am 03.09.19 um 09:35 schrieb 78666679:

> Hi, Sirs:

>         I have a wx5100 amdgpu card, It randomly come into failure.  sometimes, it will cause processes into uninterruptible wait state.

>

>

> cps-new-ondemand-0587:~ # ps aux|grep -w D

> root      11268  0.0  0.0 260628  3516 ?        Ssl  8月26   0:00 /usr/sbin/gssproxy -D

> root     136482  0.0  0.0 212500   572 pts/0    S+   15:25   0:00 grep --color=auto -w D

> root     370684  0.0  0.0  17972  7428 ?        Ss   9月02   0:04 /usr/sbin/sshd -D

> 10066    432951  0.0  0.0      0     0 ?        D    9月02   0:00 [FakeFinalizerDa]

> root     496774  0.0  0.0      0     0 ?        D    9月02   0:17 [kworker/8:1+eve]

> cps-new-ondemand-0587:~ # cat /proc/496774/stack

> [<0>] __switch_to+0x94/0xe8

> [<0>] drm_sched_entity_flush+0xf8/0x248 [gpu_sched]

> [<0>] amdgpu_ctx_mgr_entity_flush+0xac/0x148 [amdgpu]

> [<0>] amdgpu_flush+0x2c/0x50 [amdgpu]

> [<0>] filp_close+0x40/0xa0

> [<0>] put_files_struct+0x118/0x120

> [<0>] put_files_struct+0x30/0x68 [binder_linux]

> [<0>] binder_deferred_func+0x4d4/0x658 [binder_linux]

> [<0>] process_one_work+0x1b4/0x3f8

> [<0>] worker_thread+0x54/0x470

> [<0>] kthread+0x134/0x138

> [<0>] ret_from_fork+0x10/0x18

> [<0>] 0xffffffffffffffff

>

>

>

> This issue troubled me a long time.  looking eagerly to get help from you!

>

>

> -----

> Yanhua














_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx