On Tue, Aug 06, 2013 at 09:03:01PM +0300, Ville Syrjälä wrote: > On Tue, Aug 06, 2013 at 05:16:14PM +0100, Chris Wilson wrote: > > On Tue, Aug 06, 2013 at 05:24:25PM +0300, Ville Syrjälä wrote: > > > On Tue, Aug 06, 2013 at 01:17:04PM +0100, Chris Wilson wrote: > > > > +static bool cpu_write_needs_clflush(struct drm_i915_gem_object *obj) > > > > +{ > > > > + if (!cpu_cache_is_coherent(obj->base.dev, obj->cache_level)) > > > > + return true; > > > > + > > > > + return atomic_read(&obj->fb_count) > 0; > > > > +} > > > > > > I was thinking a bit about this fb_count thing. The other option would > > > be to track actual scanout activity, which wouldn't be hard to do, but > > > I guess that could shift a bunch of the clflush cost to page flip time. > > > So maybe we don't want that? > > > > Different series, see the frontbuffer write tracking. As for page flip, > > that is currently the only time we clflush... > > > > > Also i915_gem_sw_finish_ioctl() still looks at pin_count as an > > > indication whether scanout is happening. We should add an fb_count > > > check there as well. Maybe we should leave the pin_count check there, > > > so that framebuffers that aren't being scanned out currently can be > > > massaged a bit more freely before a clflush needs to be issued? > > > > It does do a fb_count check, the pin_count + fb_count is a really strong > > indicator of active scanout. > > The fb_count checks is done only through the eventual check in > i915_gem_clflush_object() which is why I didn't think of it. > > On non-LLC platforms it also ends up doing the flush for all > non-snooped non-scanout buffers, which would get optimized away if the > fb_count check was where the pin_count check is. But it does still do the fb_count check before the flush? What am I missing here? > > > > - obj->tiling_mode == I915_TILING_NONE && > > > > - obj->base.write_domain != I915_GEM_DOMAIN_CPU) { > > > > + if (obj->tiling_mode == I915_TILING_NONE && > > > > + !cpu_write_needs_clflush(obj)) { > > > > > > This would break non-LLC platforms, I think. Before, GTT writes were > > > used for non-snooped buffers w/ clean CPU caches (write domain != cpu), > > > which makes sense since access via GTT doesn't snoop ever according > > > to the docs. Now I think it'll do GTT writes for snooped non-scanout > > > buffers. > > > > D'oh. What I was aiming for was to always use shmem writes on snooped > > bo, irrespective of whether it is GPU hot. > > > > s/!cpu_write_needs_clflush/cpu_write_needs_clflush/ > > So that means a GTT write for anything non-snooped or scanout. If it's > snooped scanout (I guess we may not really care about such a weird > combination), or non-snooped but in CPU write domain, that would > still have issues with possibly dirty data in the CPU caches. Hmm, that should be fixed up by i915_gem_gtt_pwrite_fast() as we flush all the caches before we start overwriting. I've thoroughly confused myself over which paths do the fancy tricks. :( > Might be intersting to try on real hardware what the GTT access > snooping behaviour really is, especially for some modern non-LLC > system like VLV. I remember why I have the !DOMAIN_CPU test now. If we have already paid the price to pull it back into the CPU domain, let it rest there until it is used again. So as it stands, I think we want: if (obj->tiling_mode == I915_TILING_NONE && obj->base.write_domain != I915_GEM_DOMAIN_CPU && cpu_write_needs_clflush(obj)) i915_gem_gtt_pwrite_fast(); > > > > > > > > > > ret = i915_gem_gtt_pwrite_fast(dev, obj, args, file); > > > > /* Note that the gtt paths might fail with non-page-backed user > > > > * pointers (e.g. gtt mappings when moving data between > > > > @@ -3299,7 +3305,7 @@ i915_gem_clflush_object(struct drm_i915_gem_object *obj) > > > > * snooping behaviour occurs naturally as the result of our domain > > > > * tracking. > > > > */ > > > > - if (obj->cache_level != I915_CACHE_NONE) > > > > + if (!cpu_write_needs_clflush(obj)) > > > > return; > > > > > > I would actually prefer to kill this check completely, and move the > > > decision whether to do a clflush to the callers. > > > > I saw, I differ in opinion. > > Opinion noted. > > > > > > There are three places that I spotted which lack such a check; > > > i915_gem_execbuffer_move_to_gpu() and i915_gem_object_set_to_cpu_domain() > > > need to flush only when !cpu_cache_is_coherent(), whereas > > > i915_gem_object_flush_cpu_write_domain() needs to flush for scanout as > > > well. > > > > I like the approach of trying to keep as much of the complexity inside > > i915_gem.c and exporting simple rules to the callers. > > OK. Looks like you already took care of i915_gem_object_set_to_cpu_domain() > in patch 2/4, and i915_gem_object_flush_cpu_write_domain() works OK with > the check in i915_gem_clflush_object(). > > So i915_gem_execbuffer_move_to_gpu() is the only one that may end up > doing some needless flushes with your code. Gotcha. In execbuffer, we will do a clflush on any object with fb_count and write_domain=CPU. I'm not sure that is a flush worth eliminating, but that might change with GFDT. But for GFDT, we can just eliminate the clflush altogether and rely on userspace notifying the kernel it needs to flush via dirty_fb. However, even with GFDT I don't think it will be a good idea for frontbuffer writes as the cache eviction is still likely to cause blotches as the screen updates at different times. Hmm, on the other hand, a CPU upload into the backbuffer is possible, as userspace may be blissfully unaware that the buffer is actually an intel_fb. Rendering after that would indeed cause needless clflushes. I'll take another look. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx