[Public] > -----Original Message----- > From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of > Shengyu Qu > Sent: Saturday, February 3, 2024 8:05 AM > To: Kuehling, Felix <Felix.Kuehling@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Cc: wiagn233@xxxxxxxxxxx; Cornwall, Jay <Jay.Cornwall@xxxxxxx>; > Koenig, Christian <Christian.Koenig@xxxxxxx>; Paneer Selvam, Arunpravin > <Arunpravin.PaneerSelvam@xxxxxxx> > Subject: Re: drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole (v2) > > Hi Felix, > Sorry for my late reply. I was busy this week. > I just did some more tests using next-20240202 branch. Testing using blender > 4.0.2, when only one HIP render task is running, there's no problem. > However, when two tasks run together, software always crashes, but not > crashes the whole system. Dmesg reports gpu reset in most cases, for > example: > > [ 176.071823] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring > gfx_0.0.0 timeout, signaled seq=32608, emitted seq=32610 [ 176.072000] > [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process > information: process blender pid 4256 thread blender:cs0 pid 4297 > [ 176.072143] amdgpu 0000:03:00.0: amdgpu: GPU reset begin! > [ 176.073571] amdgpu 0000:03:00.0: amdgpu: Guilty job already signaled, > skipping HW reset [ 176.073593] amdgpu 0000:03:00.0: amdgpu: GPU > reset(4) succeeded! > > And in some rare cases, there would be a page fault report, see dmesg.log. > Do you have any idea? Can I make it print more detailed diagnostic > information? Are you only seeing the problem with this patch applied or in general? If you are seeing it in general, it likely related to a firmware issue that was recently fixed that will be resolved with an update CP firmware image. Driver side changes: https://gitlab.freedesktop.org/agd5f/linux/-/commit/0eb6c664b780dd1b4080e047ad51b100cd7840a3 https://gitlab.freedesktop.org/agd5f/linux/-/commit/40970e60070ed3d1390ec65e38e819f6d81b8f0c Alex > > Best regards, > Shengyu > > > 在 2024/1/30 01:47, Felix Kuehling 写道: > > On 2024-01-29 10:24, Shengyu Qu wrote: > >> Hello Felix, > >> I think you are right. This problem has existed for years(just look > >> at the issue creation time in my link), and is thought caused by > >> OpenGL-ROCM interop(that's why I think this patch might help). It is > >> very easy to trigger this problem in blender(method is also mentioned > >> in the link). > > > > This doesn't help you, but it's unlikely that this has been the same > > issue for two years for everybody who chimed into this bug report. > > Different kernel versions, GPUs, user mode ROCm and Mesa versions etc. > > > > Case in point, it's possible that you're seeing an issue specific to > > RDNA3, which hasn't even been around for that long. > > > > > >> Do > >> you have any idea about this? > > > > Not without seeing a lot more diagnostic information. A full backtrace > > from your kernel log would be a good start. > > > > Regards, > > Felix > > > > > >> Best regards, > >> Shengyu > >> 在 2024/1/29 22:51, Felix Kuehling 写道: > >>> On 2024-01-29 8:58, Shengyu Qu wrote: > >>>> Hi, > >>>> Seems rocm-opengl interop hang problem still exists[1]. Btw have > >>>> you discovered into this problem? > >>>> Best regards, > >>>> Shengyu > >>>> [1] > >>>> https://projects.blender.org/blender/blender/issues/100353#issuecom > >>>> ment-1111599 > >>> > >>> Maybe you're having a different problem. Do you see this issue also > >>> without any version of the "Relocate TBA/TMA ..." patch? > >>> > >>> Regards, > >>> Felix > >>> > >>> > >>>> > >>>> 在 2024/1/27 03:15, Shengyu Qu 写道: > >>>>> Hello Felix, > >>>>> This patch seems working on my system, also it seems fixes the > >>>>> ROCM/OpenGL interop problem. > >>>>> Is this intended to happen or not? Maybe we need more users to > >>>>> test it. > >>>>> Besides, > >>>>> Tested-by: Shengyu Qu <wiagn233@xxxxxxxxxxx> Best Regards, > Shengyu > >>>>> > >>>>> 在 2024/1/26 06:27, Felix Kuehling 写道: > >>>>>> The TBA and TMA, along with an unused IB allocation, reside at > >>>>>> low addresses in the VM address space. A stray VM fault which > >>>>>> hits these pages must be serviced by making their page table entries > invalid. > >>>>>> The scheduler depends upon these pages being resident and fails, > >>>>>> preventing a debugger from inspecting the failure state. > >>>>>> > >>>>>> By relocating these pages above 47 bits in the VM address space > >>>>>> they can only be reached when bits [63:48] are set to 1. This > >>>>>> makes it much less likely for a misbehaving program to generate > >>>>>> accesses to them. > >>>>>> The current placement at VA (PAGE_SIZE*2) is readily hit by a > >>>>>> NULL access with a small offset. > >>>>>> > >>>>>> v2: > >>>>>> - Move it to the reserved space to avoid concflicts with Mesa > >>>>>> - Add macros to make reserved space management easier > >>>>>> > >>>>>> Cc: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@xxxxxxx> > >>>>>> Cc: Christian Koenig <christian.koenig@xxxxxxx> > >>>>>> Signed-off-by: Jay Cornwall <jay.cornwall@xxxxxxx> > >>>>>> Signed-off-by: Felix Kuehling <felix.kuehling@xxxxxxx> > >>>>>> --- > >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c | 4 +-- > >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c | 7 ++--- > >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 12 ++++++-- > >>>>>> drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c | 30 > >>>>>> +++++++++++--------- > >>>>>> 4 files changed, 30 insertions(+), 23 deletions(-) > >>>>>> > >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c > >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c > >>>>>> index 823d31f4a2a3..53d0a458d78e 100644 > >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c > >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c > >>>>>> @@ -28,9 +28,9 @@ > >>>>>> uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev) > >>>>>> { > >>>>>> - uint64_t addr = adev->vm_manager.max_pfn << > >>>>>> AMDGPU_GPU_PAGE_SHIFT; > >>>>>> + uint64_t addr = AMDGPU_VA_RESERVED_CSA_START( > >>>>>> + adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT); > >>>>>> - addr -= AMDGPU_VA_RESERVED_CSA_SIZE; > >>>>>> addr = amdgpu_gmc_sign_extend(addr); > >>>>>> return addr; > >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c > >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c > >>>>>> index 3d0d56087d41..9e769ef50f2e 100644 > >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c > >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c > >>>>>> @@ -45,11 +45,8 @@ > >>>>>> */ > >>>>>> static inline u64 amdgpu_seq64_get_va_base(struct amdgpu_device > >>>>>> *adev) > >>>>>> { > >>>>>> - u64 addr = adev->vm_manager.max_pfn << > >>>>>> AMDGPU_GPU_PAGE_SHIFT; > >>>>>> - > >>>>>> - addr -= AMDGPU_VA_RESERVED_TOP; > >>>>>> - > >>>>>> - return addr; > >>>>>> + return AMDGPU_VA_RESERVED_SEQ64_START( > >>>>>> + adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT); > >>>>>> } > >>>>>> /** > >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h > >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h > >>>>>> index 666698a57192..f23b6153d310 100644 > >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h > >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h > >>>>>> @@ -135,11 +135,19 @@ struct amdgpu_mem_stats; > >>>>>> #define AMDGPU_IS_MMHUB0(x) ((x) >= > AMDGPU_MMHUB0_START && (x) > >>>>>> < AMDGPU_MMHUB1_START) > >>>>>> #define AMDGPU_IS_MMHUB1(x) ((x) >= > AMDGPU_MMHUB1_START && (x) > >>>>>> < AMDGPU_MAX_VMHUBS) > >>>>>> -/* Reserve 2MB at top/bottom of address space for kernel use > >>>>>> */ > >>>>>> +/* Reserve space at top/bottom of address space for kernel use > >>>>>> +*/ > >>>>>> #define AMDGPU_VA_RESERVED_CSA_SIZE (2ULL << 20) > >>>>>> +#define AMDGPU_VA_RESERVED_CSA_START(top) ((top) \ > >>>>>> + - AMDGPU_VA_RESERVED_CSA_SIZE) > >>>>>> #define AMDGPU_VA_RESERVED_SEQ64_SIZE (2ULL << 20) > >>>>>> +#define AMDGPU_VA_RESERVED_SEQ64_START(top) > >>>>>> (AMDGPU_VA_RESERVED_CSA_START(top) \ > >>>>>> + - AMDGPU_VA_RESERVED_SEQ64_SIZE) > >>>>>> +#define AMDGPU_VA_RESERVED_TRAP_SIZE (2ULL << 12) #define > >>>>>> +AMDGPU_VA_RESERVED_TRAP_START(top) > >>>>>> (AMDGPU_VA_RESERVED_SEQ64_START(top) \ > >>>>>> + - AMDGPU_VA_RESERVED_TRAP_SIZE) > >>>>>> #define AMDGPU_VA_RESERVED_BOTTOM (2ULL << 20) -#define > >>>>>> AMDGPU_VA_RESERVED_TOP (AMDGPU_VA_RESERVED_SEQ64_SIZE > + \ > >>>>>> +#define AMDGPU_VA_RESERVED_TOP > (AMDGPU_VA_RESERVED_TRAP_SIZE + \ > >>>>>> + AMDGPU_VA_RESERVED_SEQ64_SIZE + \ > >>>>>> AMDGPU_VA_RESERVED_CSA_SIZE) > >>>>>> /* See vm_update_mode */ > >>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c > >>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c > >>>>>> index 6604a3f99c5e..f899cce25b2a 100644 > >>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c > >>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c > >>>>>> @@ -36,6 +36,7 @@ > >>>>>> #include <linux/mm.h> > >>>>>> #include <linux/mman.h> > >>>>>> #include <linux/processor.h> > >>>>>> +#include "amdgpu_vm.h" > >>>>>> /* > >>>>>> * The primary memory I/O features being added for revisions of > >>>>>> gfxip @@ -326,10 +327,16 @@ static void > >>>>>> kfd_init_apertures_vi(struct kfd_process_device *pdd, uint8_t id) > >>>>>> * with small reserved space for kernel. > >>>>>> * Set them to CANONICAL addresses. > >>>>>> */ > >>>>>> - pdd->gpuvm_base = SVM_USER_BASE; > >>>>>> + pdd->gpuvm_base = max(SVM_USER_BASE, > >>>>>> AMDGPU_VA_RESERVED_BOTTOM); > >>>>>> pdd->gpuvm_limit = > >>>>>> pdd->dev->kfd->shared_resources.gpuvm_size - 1; > >>>>>> + /* dGPUs: the reserved space for kernel > >>>>>> + * before SVM > >>>>>> + */ > >>>>>> + pdd->qpd.cwsr_base = SVM_CWSR_BASE; > >>>>>> + pdd->qpd.ib_base = SVM_IB_BASE; > >>>>>> + > >>>>>> pdd->scratch_base = MAKE_SCRATCH_APP_BASE_VI(); > >>>>>> pdd->scratch_limit = > >>>>>> MAKE_SCRATCH_APP_LIMIT(pdd->scratch_base); > >>>>>> } > >>>>>> @@ -339,18 +346,19 @@ static void kfd_init_apertures_v9(struct > >>>>>> kfd_process_device *pdd, uint8_t id) > >>>>>> pdd->lds_base = MAKE_LDS_APP_BASE_V9(); > >>>>>> pdd->lds_limit = MAKE_LDS_APP_LIMIT(pdd->lds_base); > >>>>>> - /* Raven needs SVM to support graphic handle, etc. > >>>>>> Leave the small > >>>>>> - * reserved space before SVM on Raven as well, even > >>>>>> though we don't > >>>>>> - * have to. > >>>>>> - * Set gpuvm_base and gpuvm_limit to CANONICAL addresses > >>>>>> so that they > >>>>>> - * are used in Thunk to reserve SVM. > >>>>>> - */ > >>>>>> - pdd->gpuvm_base = SVM_USER_BASE; > >>>>>> + pdd->gpuvm_base = AMDGPU_VA_RESERVED_BOTTOM; > >>>>>> pdd->gpuvm_limit = > >>>>>> pdd->dev->kfd->shared_resources.gpuvm_size - 1; > >>>>>> pdd->scratch_base = MAKE_SCRATCH_APP_BASE_V9(); > >>>>>> pdd->scratch_limit = > >>>>>> MAKE_SCRATCH_APP_LIMIT(pdd->scratch_base); > >>>>>> + > >>>>>> + /* > >>>>>> + * Place TBA/TMA on opposite side of VM hole to prevent > >>>>>> + * stray faults from triggering SVM on these pages. > >>>>>> + */ > >>>>>> + pdd->qpd.cwsr_base = AMDGPU_VA_RESERVED_TRAP_START( > >>>>>> + pdd->dev->adev->vm_manager.max_pfn << > >>>>>> AMDGPU_GPU_PAGE_SHIFT); > >>>>>> } > >>>>>> int kfd_init_apertures(struct kfd_process *process) @@ > >>>>>> -407,12 +415,6 @@ int kfd_init_apertures(struct kfd_process > >>>>>> *process) > >>>>>> return -EINVAL; > >>>>>> } > >>>>>> } > >>>>>> - > >>>>>> - /* dGPUs: the reserved space for kernel > >>>>>> - * before SVM > >>>>>> - */ > >>>>>> - pdd->qpd.cwsr_base = SVM_CWSR_BASE; > >>>>>> - pdd->qpd.ib_base = SVM_IB_BASE; > >>>>>> } > >>>>>> dev_dbg(kfd_device, "node id %u\n", id); > >>>>>