That's exactly my concern as well.
This looks a bit like the test creates erroneous data somehow, but there doesn't seems to be a RAS check in the MM data path.
And now that we use the BAR path it goes up in flames.
I just don't see how we can create erroneous data in a test case?
Christian.
Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@xxxxxxx>:
Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@xxxxxxx>:
Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@xxxxxxx>:
Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@xxxxxxx>:
Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@xxxxxxx>:
[AMD Public Use]
If this causes an issue, any access to vram via the BAR could cause an issue.
Alex
From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> on behalf of Russell, Kent <Kent.Russell@xxxxxxx>
Sent: Tuesday, April 14, 2020 10:19 AM To: Koenig, Christian <Christian.Koenig@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx <amd-gfx@xxxxxxxxxxxxxxxxxxxxx> Cc: Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Kim, Jonathan <Jonathan.Kim@xxxxxxx> Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2" [AMD Official Use Only - Internal Distribution Only]
On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information. Kent > -----Original Message----- > From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> > Sent: Tuesday, April 14, 2020 9:52 AM > To: Russell, Kent <Kent.Russell@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in > amdgpu_device_vram_access v2" > > Am 13.04.20 um 20:20 schrieb Kent Russell: > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e. > > The original patch causes a RAS event and subsequent kernel hard-hang > > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and > > Arcturus > > > > dmesg output at hang time: > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected! > > amdgpu 0000:67:00.0: GPU reset begin! > > Evicting PASID 0x8000 queues > > Started evicting pasid 0x8000 > > qcm fence wait loop timeout expired > > The cp might be in an unrecoverable state due to an unsuccessful > > queues preemption Failed to evict process queues Failed to suspend > > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost > > due to RAS ERREVENT_ATHUB_INTERRUPT > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0 > > amdgpu: [powerplay] Failed to set soft min gfxclk ! > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels! > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0 > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu > features! > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features! > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM! > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP > > block <powerplay> failed -5 > > Do you have more information on what's going wrong here since this is a really > important patch for KFD debugging. > > > > > Signed-off-by: Kent Russell <kent.russell@xxxxxxx> > > Reviewed-by: Christian König <christian.koenig@xxxxxxx> > > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ---------------------- > > 1 file changed, 26 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > index cf5d6e585634..a3f997f84020 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct > amdgpu_device *adev, loff_t pos, > > uint32_t hi = ~0; > > uint64_t last; > > > > - > > -#ifdef CONFIG_64BIT > > - last = min(pos + size, adev->gmc.visible_vram_size); > > - if (last > pos) { > > - void __iomem *addr = adev->mman.aper_base_kaddr + pos; > > - size_t count = last - pos; > > - > > - if (write) { > > - memcpy_toio(addr, buf, count); > > - mb(); > > - amdgpu_asic_flush_hdp(adev, NULL); > > - } else { > > - amdgpu_asic_invalidate_hdp(adev, NULL); > > - mb(); > > - memcpy_fromio(buf, addr, count); > > - } > > - > > - if (count == size) > > - return; > > - > > - pos += count; > > - buf += count / 4; > > - size -= count; > > - } > > -#endif > > - > > spin_lock_irqsave(&adev->mmio_idx_lock, flags); > > for (last = pos + size; pos < last; pos += 4) { > > uint32_t tmp = pos >> 31; _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://nam11.safelinks.protection.outlook.com/?url=""> |
_______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx