On 20.06.2017 12:34, Marek Olšák wrote: > BTW, I noticed the flush sequence in the kernel is wrong. The correct > flush sequence should be: > > 1) EVENT_WRITE_EOP - CACHE_FLUSH_AND_INV_TS - write a dword to memory, > but no fence/interrupt. > 2) WAIT_REG_MEM on the dword to wait for idle before SURFACE_SYNC. > 3) SURFACE_SYNC (TC, K$, I$) > 4) Write CP_COHER_CNTL2. > 5) EVENT_WRITE_EOP - BOTTOM_OF_PIPE_TS - write the fence with the interrupt. > > WAIT_REG_MEM wouldn't be needed if we were able to merge > CACHE_FLUSH_AND_INV, SURFACE_SYNC, and CP_COHER_CNTL2 into one EOP > event. > > The main issue with the current flush sequence in radeon and amdgpu is > that it doesn't wait for idle before writing CP_COHER_CNTL2 and > SURFACE_SYNC. So far we've been able to avoid the bug by waiting for > idle in userspace IBs. This is gfx9-only though, right? Cheers, Nicolai > > Marek > > > On Fri, May 26, 2017 at 5:47 PM, Marek Olšák <maraeo at gmail.com> wrote: >> On Tue, May 9, 2017 at 2:13 PM, Nicolai Hähnle <nhaehnle at gmail.com> wrote: >>> Hi all, >>> >>> I'm seeing some very strange errors on Verde with CPU readback from GART, >>> and am pretty much out of ideas. Some help would be very much appreciated. >>> >>> The error manifests with the >>> GL45-CTS.gtf32.GL3Tests.packed_pixels.packed_pixels_pbo test on amdgpu, >>> but >>> *not* on radeon. Here's what the test does: >>> >>> 1. Upload a texture. >>> 2. Read the texture back via a shader that uses shader buffer writes to >>> write data to a buffer that is allocated in GART. >>> 3. The CPU then reads from the buffer -- and sometimes gets stale data. >>> >>> This sequence is repeated for many sub-tests. There are some sub-tests >>> where >>> the CPU reads stale data from the buffer, i.e. the shader writes simply >>> don't make it to the CPU. The tests vary superficially, e.g. the first >>> failing test is (almost?) always one where data is written in 16-bit words >>> (but there are succeeding sub-tests with 16-bit writes as well). >>> >>> The bug is *not* a timing issue. Adding even a 1sec delay (sleep(1);) >>> between the fence wait and the return of glMapBuffer does not fix the >>> problem. The data must be stuck in a cache somewhere. >>> >>> Since the test runs okay with the radeon module, I tried some changes >>> based >>> on comparing the IB submit between radeon and amdgpu, and based on >>> comparing >>> register settings via scans obtained from umr. Some of the things I've >>> tried: >>> >>> - Set HDP_MISC_CNTL.FLUSH_INVALIDATE_CACHE to 1 (both radeon and >>> amdgpu/gfx9 >>> set this) >>> - Add SURFACE_SYNC packets preceded by setting CP_COHER_CNTL2 to the vmid >>> (radeon does this) >>> - Change gfx_v6_0_ring_emit_hdp_invalidate: select ME engine instead of >>> PFP >>> (which seems more logical, and is done by gfx7+), or remove the >>> corresponding WRITE_DATA entirely >>> >>> None of these changes helped. >>> >>> What *does* help is adding an artificial wait. Specifically, I'm adding a >>> sequence of >>> >>> - WRITE_DATA >>> - CACHE_FLUSH_AND_INV_TS_EVENT (BOTTOM_OF_PIPE_TS has same behavior) >>> - WAIT_REG_MEM >>> >>> as can be seen in the attached patch. This works around the problem, but >>> it >>> makes no sense: >>> >>> Adding the wait sequence *before* the SURFACE_SYNC in ring_emit_fence >>> works >>> around the problem. However(!) it does not actually cause the UMD to wait >>> any longer than before. Without this change, the UMD immediately sees a >>> signaled user fence (and never uses an ioctl to wait), and with this >>> change, >>> it *still* sees a signaled user fence. >>> >>> Also, note that the way I've hacked the change, the wait sequence is only >>> added for the user fence emit (and I'm using a modified UMD to ensure that >>> there is enough memory to be used by the added wait sequence). >>> >>> Adding the wait sequence *after* the SURFACE_SYNC *doesn't* work around >>> the >>> problem. >>> >>> So for whatever reason, the added wait sequence *before* the SURFACE_SYNC >>> encourages some part of the GPU to flush the data from wherever it's >>> stuck, >>> and that's just really bizarre. There must be something really simple I'm >>> missing, and any pointers would be appreciated. >> >> Have you tried this? >> >> diff --git a/src/gallium/drivers/radeonsi/si_hw_context.c >> b/src/gallium/drivers/radeonsi/si_hw_context.c >> index 92c09cb..e6ac0ba 100644 >> --- a/src/gallium/drivers/radeonsi/si_hw_context.c >> +++ b/src/gallium/drivers/radeonsi/si_hw_context.c >> @@ -133,7 +133,8 @@ void si_context_gfx_flush(void *context, unsigned flags, >> SI_CONTEXT_PS_PARTIAL_FLUSH; >> >> /* DRM 3.1.0 doesn't flush TC for VI correctly. */ >> - if (ctx->b.chip_class == VI && ctx->b.screen->info.drm_minor <= 1) >> + if ((ctx->b.chip_class == VI && ctx->b.screen->info.drm_minor <= 1) >> || >> + (ctx->b.chip_class == SI && ctx->b.screen->info.drm_major == 3)) >> ctx->b.flags |= SI_CONTEXT_INV_GLOBAL_L2 | >> SI_CONTEXT_INV_VMEM_L1; >> >> One more cache flush there shouldn't hurt. >> >> Also, Mesa uses PFP_SYNC_ME. It shouldn't be necessary, but it's worth a >> try. >> >> Marek -- Lerne, wie die Welt wirklich ist, Aber vergiss niemals, wie sie sein sollte.