Hi Felix, Thanks for reply. I'll record a backtrace when I'm free. Besides, here is a dmesg log from someone else in the issue discussion about this problem: https://projects.blender.org/attachments/ea7b7db5-ac16-479d-935b-9e1da33cd6f0 Tested using next-20240129 with this patch applied, and setup is Plasma 6.0 RC1(Wayland) + RX 6600 XT. Best regards, Shengyu 在 2024/1/30 1:47, Felix Kuehling 写道: > On 2024-01-29 10:24, Shengyu Qu wrote: >> Hello Felix, >> I think you are right. This problem has existed for years(just look at >> the >> issue creation time in my link), and is thought caused by OpenGL-ROCMTe >> interop(that's why I think this patch might help). It is very easy to >> trigger this problem in blender(method is also mentioned in the link). > > This doesn't help you, but it's unlikely that this has been the same > issue for two years for everybody who chimed into this bug report. > Different kernel versions, GPUs, user mode ROCm and Mesa versions etc. > > Case in point, it's possible that you're seeing an issue specific to > RDNA3, which hasn't even been around for that long. > > >> Do >> you have any idea about this? > > Not without seeing a lot more diagnostic information. A full backtrace > from your kernel log would be a good start. > > Regards, > Felix > > >> Best regards, >> Shengyu >> 在 2024/1/29 22:51, Felix Kuehling 写道: >>> On 2024-01-29 8:58, Shengyu Qu wrote: >>>> Hi, >>>> Seems rocm-opengl interop hang problem still exists[1]. Btw have you >>>> discovered into this problem? >>>> Best regards, >>>> Shengyu >>>> [1] >>>> https://projects.blender.org/blender/blender/issues/100353#issuecomment-1111599 >>> >>> Maybe you're having a different problem. Do you see this issue also >>> without any version of the "Relocate TBA/TMA ..." patch? >>> >>> Regards, >>> Felix >>> >>> >>>> >>>> 在 2024/1/27 03:15, Shengyu Qu 写道: >>>>> Hello Felix, >>>>> This patch seems working on my system, also it seems fixes the >>>>> ROCM/OpenGL >>>>> interop problem. >>>>> Is this intended to happen or not? Maybe we need more users to test >>>>> it. >>>>> Besides, >>>>> Tested-by: Shengyu Qu <wiagn233@xxxxxxxxxxx> >>>>> Best Regards, >>>>> Shengyu >>>>> >>>>> 在 2024/1/26 06:27, Felix Kuehling 写道: >>>>>> The TBA and TMA, along with an unused IB allocation, reside at low >>>>>> addresses in the VM address space. A stray VM fault which hits these >>>>>> pages must be serviced by making their page table entries invalid. >>>>>> The scheduler depends upon these pages being resident and fails, >>>>>> preventing a debugger from inspecting the failure state. >>>>>> >>>>>> By relocating these pages above 47 bits in the VM address space they >>>>>> can only be reached when bits [63:48] are set to 1. This makes it >>>>>> much >>>>>> less likely for a misbehaving program to generate accesses to them. >>>>>> The current placement at VA (PAGE_SIZE*2) is readily hit by a NULL >>>>>> access with a small offset. >>>>>> >>>>>> v2: >>>>>> - Move it to the reserved space to avoid concflicts with Mesa >>>>>> - Add macros to make reserved space management easier >>>>>> >>>>>> Cc: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@xxxxxxx> >>>>>> Cc: Christian Koenig <christian.koenig@xxxxxxx> >>>>>> Signed-off-by: Jay Cornwall <jay.cornwall@xxxxxxx> >>>>>> Signed-off-by: Felix Kuehling <felix.kuehling@xxxxxxx> >>>>>> --- >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c | 4 +-- >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c | 7 ++--- >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 12 ++++++-- >>>>>> drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c | 30 >>>>>> +++++++++++--------- >>>>>> 4 files changed, 30 insertions(+), 23 deletions(-) >>>>>> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c >>>>>> index 823d31f4a2a3..53d0a458d78e 100644 >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c >>>>>> @@ -28,9 +28,9 @@ >>>>>> uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev) >>>>>> { >>>>>> - uint64_t addr = adev->vm_manager.max_pfn << >>>>>> AMDGPU_GPU_PAGE_SHIFT; >>>>>> + uint64_t addr = AMDGPU_VA_RESERVED_CSA_START( >>>>>> + adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT); >>>>>> - addr -= AMDGPU_VA_RESERVED_CSA_SIZE; >>>>>> addr = amdgpu_gmc_sign_extend(addr); >>>>>> return addr; >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c >>>>>> index 3d0d56087d41..9e769ef50f2e 100644 >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c >>>>>> @@ -45,11 +45,8 @@ >>>>>> */ >>>>>> static inline u64 amdgpu_seq64_get_va_base(struct amdgpu_device >>>>>> *adev) >>>>>> { >>>>>> - u64 addr = adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT; >>>>>> - >>>>>> - addr -= AMDGPU_VA_RESERVED_TOP; >>>>>> - >>>>>> - return addr; >>>>>> + return AMDGPU_VA_RESERVED_SEQ64_START( >>>>>> + adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT); >>>>>> } >>>>>> /** >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h >>>>>> index 666698a57192..f23b6153d310 100644 >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h >>>>>> @@ -135,11 +135,19 @@ struct amdgpu_mem_stats; >>>>>> #define AMDGPU_IS_MMHUB0(x) ((x) >= AMDGPU_MMHUB0_START && (x) < >>>>>> AMDGPU_MMHUB1_START) >>>>>> #define AMDGPU_IS_MMHUB1(x) ((x) >= AMDGPU_MMHUB1_START && (x) < >>>>>> AMDGPU_MAX_VMHUBS) >>>>>> -/* Reserve 2MB at top/bottom of address space for kernel use */ >>>>>> +/* Reserve space at top/bottom of address space for kernel use */ >>>>>> #define AMDGPU_VA_RESERVED_CSA_SIZE (2ULL << 20) >>>>>> +#define AMDGPU_VA_RESERVED_CSA_START(top) ((top) \ >>>>>> + - AMDGPU_VA_RESERVED_CSA_SIZE) >>>>>> #define AMDGPU_VA_RESERVED_SEQ64_SIZE (2ULL << 20) >>>>>> +#define AMDGPU_VA_RESERVED_SEQ64_START(top) >>>>>> (AMDGPU_VA_RESERVED_CSA_START(top) \ >>>>>> + - AMDGPU_VA_RESERVED_SEQ64_SIZE) >>>>>> +#define AMDGPU_VA_RESERVED_TRAP_SIZE (2ULL << 12) >>>>>> +#define AMDGPU_VA_RESERVED_TRAP_START(top) >>>>>> (AMDGPU_VA_RESERVED_SEQ64_START(top) \ >>>>>> + - AMDGPU_VA_RESERVED_TRAP_SIZE) >>>>>> #define AMDGPU_VA_RESERVED_BOTTOM (2ULL << 20) >>>>>> -#define AMDGPU_VA_RESERVED_TOP (AMDGPU_VA_RESERVED_SEQ64_SIZE + \ >>>>>> +#define AMDGPU_VA_RESERVED_TOP (AMDGPU_VA_RESERVED_TRAP_SIZE + \ >>>>>> + AMDGPU_VA_RESERVED_SEQ64_SIZE + \ >>>>>> AMDGPU_VA_RESERVED_CSA_SIZE) >>>>>> /* See vm_update_mode */ >>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c >>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c >>>>>> index 6604a3f99c5e..f899cce25b2a 100644 >>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c >>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c >>>>>> @@ -36,6 +36,7 @@ >>>>>> #include <linux/mm.h> >>>>>> #include <linux/mman.h> >>>>>> #include <linux/processor.h> >>>>>> +#include "amdgpu_vm.h" >>>>>> /* >>>>>> * The primary memory I/O features being added for revisions of >>>>>> gfxip >>>>>> @@ -326,10 +327,16 @@ static void kfd_init_apertures_vi(struct >>>>>> kfd_process_device *pdd, uint8_t id) >>>>>> * with small reserved space for kernel. >>>>>> * Set them to CANONICAL addresses. >>>>>> */ >>>>>> - pdd->gpuvm_base = SVM_USER_BASE; >>>>>> + pdd->gpuvm_base = max(SVM_USER_BASE, AMDGPU_VA_RESERVED_BOTTOM); >>>>>> pdd->gpuvm_limit = >>>>>> pdd->dev->kfd->shared_resources.gpuvm_size - 1; >>>>>> + /* dGPUs: the reserved space for kernel >>>>>> + * before SVM >>>>>> + */ >>>>>> + pdd->qpd.cwsr_base = SVM_CWSR_BASE; >>>>>> + pdd->qpd.ib_base = SVM_IB_BASE; >>>>>> + >>>>>> pdd->scratch_base = MAKE_SCRATCH_APP_BASE_VI(); >>>>>> pdd->scratch_limit = MAKE_SCRATCH_APP_LIMIT(pdd->scratch_base); >>>>>> } >>>>>> @@ -339,18 +346,19 @@ static void kfd_init_apertures_v9(struct >>>>>> kfd_process_device *pdd, uint8_t id) >>>>>> pdd->lds_base = MAKE_LDS_APP_BASE_V9(); >>>>>> pdd->lds_limit = MAKE_LDS_APP_LIMIT(pdd->lds_base); >>>>>> - /* Raven needs SVM to support graphic handle, etc. >>>>>> Leave the small >>>>>> - * reserved space before SVM on Raven as well, even >>>>>> though we don't >>>>>> - * have to. >>>>>> - * Set gpuvm_base and gpuvm_limit to CANONICAL addresses >>>>>> so that they >>>>>> - * are used in Thunk to reserve SVM. >>>>>> - */ >>>>>> - pdd->gpuvm_base = SVM_USER_BASE; >>>>>> + pdd->gpuvm_base = AMDGPU_VA_RESERVED_BOTTOM; >>>>>> pdd->gpuvm_limit = >>>>>> pdd->dev->kfd->shared_resources.gpuvm_size - 1; >>>>>> pdd->scratch_base = MAKE_SCRATCH_APP_BASE_V9(); >>>>>> pdd->scratch_limit = MAKE_SCRATCH_APP_LIMIT(pdd->scratch_base); >>>>>> + >>>>>> + /* >>>>>> + * Place TBA/TMA on opposite side of VM hole to prevent >>>>>> + * stray faults from triggering SVM on these pages. >>>>>> + */ >>>>>> + pdd->qpd.cwsr_base = AMDGPU_VA_RESERVED_TRAP_START( >>>>>> + pdd->dev->adev->vm_manager.max_pfn << >>>>>> AMDGPU_GPU_PAGE_SHIFT); >>>>>> } >>>>>> int kfd_init_apertures(struct kfd_process *process) >>>>>> @@ -407,12 +415,6 @@ int kfd_init_apertures(struct kfd_process >>>>>> *process) >>>>>> return -EINVAL; >>>>>> } >>>>>> } >>>>>> - >>>>>> - /* dGPUs: the reserved space for kernel >>>>>> - * before SVM >>>>>> - */ >>>>>> - pdd->qpd.cwsr_base = SVM_CWSR_BASE; >>>>>> - pdd->qpd.ib_base = SVM_IB_BASE; >>>>>> } >>>>>> dev_dbg(kfd_device, "node id %u\n", id); >>>>>