GART write flush error on SI w/ amdgpu

nhaehnle@xxxxxxxxx (Nicolai Hähnle) · Tue, 9 May 2017 14:13:40 +0200

Hi all,

I'm seeing some very strange errors on Verde with CPU readback from 
GART, and am pretty much out of ideas. Some help would be very much 
appreciated.

The error manifests with the 
GL45-CTS.gtf32.GL3Tests.packed_pixels.packed_pixels_pbo test on amdgpu, 
but *not* on radeon. Here's what the test does:

1. Upload a texture.
2. Read the texture back via a shader that uses shader buffer writes to 
write data to a buffer that is allocated in GART.
3. The CPU then reads from the buffer -- and sometimes gets stale data.

This sequence is repeated for many sub-tests. There are some sub-tests 
where the CPU reads stale data from the buffer, i.e. the shader writes 
simply don't make it to the CPU. The tests vary superficially, e.g. the 
first failing test is (almost?) always one where data is written in 
16-bit words (but there are succeeding sub-tests with 16-bit writes as 
well).

The bug is *not* a timing issue. Adding even a 1sec delay (sleep(1);) 
between the fence wait and the return of glMapBuffer does not fix the 
problem. The data must be stuck in a cache somewhere.

Since the test runs okay with the radeon module, I tried some changes 
based on comparing the IB submit between radeon and amdgpu, and based on 
comparing register settings via scans obtained from umr. Some of the 
things I've tried:

- Set HDP_MISC_CNTL.FLUSH_INVALIDATE_CACHE to 1 (both radeon and 
amdgpu/gfx9 set this)
- Add SURFACE_SYNC packets preceded by setting CP_COHER_CNTL2 to the 
vmid (radeon does this)
- Change gfx_v6_0_ring_emit_hdp_invalidate: select ME engine instead of 
PFP (which seems more logical, and is done by gfx7+), or remove the 
corresponding WRITE_DATA entirely

None of these changes helped.

What *does* help is adding an artificial wait. Specifically, I'm adding 
a sequence of

- WRITE_DATA
- CACHE_FLUSH_AND_INV_TS_EVENT (BOTTOM_OF_PIPE_TS has same behavior)
- WAIT_REG_MEM

as can be seen in the attached patch. This works around the problem, but 
it makes no sense:

Adding the wait sequence *before* the SURFACE_SYNC in ring_emit_fence 
works around the problem. However(!) it does not actually cause the UMD 
to wait any longer than before. Without this change, the UMD immediately 
sees a signaled user fence (and never uses an ioctl to wait), and with 
this change, it *still* sees a signaled user fence.

Also, note that the way I've hacked the change, the wait sequence is 
only added for the user fence emit (and I'm using a modified UMD to 
ensure that there is enough memory to be used by the added wait sequence).

Adding the wait sequence *after* the SURFACE_SYNC *doesn't* work around 
the problem.

So for whatever reason, the added wait sequence *before* the 
SURFACE_SYNC encourages some part of the GPU to flush the data from 
wherever it's stuck, and that's just really bizarre. There must be 
something really simple I'm missing, and any pointers would be appreciated.

Thanks,
Nicolai
-- 
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gfx6_hacks.diff
Type: text/x-patch
Size: 6321 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20170509/07d676d5/attachment.bin>