On Thu, Nov 26, 2015 at 09:09:37AM +0530, Goel, Akash wrote: > > > On 11/25/2015 10:58 PM, Chris Wilson wrote: > >On Wed, Nov 25, 2015 at 01:02:20PM +0200, Ville Syrjälä wrote: > >>On Tue, Nov 24, 2015 at 10:39:38PM +0000, Chris Wilson wrote: > >>>On Tue, Nov 24, 2015 at 07:14:31PM +0100, Daniel Vetter wrote: > >>>>On Tue, Nov 24, 2015 at 12:04:06PM +0200, Ville Syrjälä wrote: > >>>>>On Tue, Nov 24, 2015 at 03:35:24PM +0530, akash.goel@xxxxxxxxx wrote: > >>>>>>From: Akash Goel <akash.goel@xxxxxxxxx> > >>>>>> > >>>>>>When the object is moved out of CPU read domain, the cachelines > >>>>>>are not invalidated immediately. The invalidation is deferred till > >>>>>>next time the object is brought back into CPU read domain. > >>>>>>But the invalidation is done unconditionally, i.e. even for the case > >>>>>>where the cachelines were flushed previously, when the object moved out > >>>>>>of CPU write domain. This is avoidable and would lead to some optimization. > >>>>>>Though this is not a hypothetical case, but is unlikely to occur often. > >>>>>>The aim is to detect changes to the backing storage whilst the > >>>>>>data is potentially in the CPU cache, and only clflush in those case. > >>>>>> > >>>>>>Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > >>>>>>Signed-off-by: Akash Goel <akash.goel@xxxxxxxxx> > >>>>>>--- > >>>>>> drivers/gpu/drm/i915/i915_drv.h | 1 + > >>>>>> drivers/gpu/drm/i915/i915_gem.c | 9 ++++++++- > >>>>>> 2 files changed, 9 insertions(+), 1 deletion(-) > >>>>>> > >>>>>>diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h > >>>>>>index df9316f..fedb71d 100644 > >>>>>>--- a/drivers/gpu/drm/i915/i915_drv.h > >>>>>>+++ b/drivers/gpu/drm/i915/i915_drv.h > >>>>>>@@ -2098,6 +2098,7 @@ struct drm_i915_gem_object { > >>>>>> unsigned long gt_ro:1; > >>>>>> unsigned int cache_level:3; > >>>>>> unsigned int cache_dirty:1; > >>>>>>+ unsigned int cache_clean:1; > >>>>> > >>>>>So now we have cache_dirty and cache_clean which seems redundant, > >>>>>except somehow cache_dirty != !cache_clean? > >>> > >>>Exactly, not entirely redundant. I did think something along MESI lines > >>>would be useful, but that didn't capture the different meanings we > >>>employ. > >>> > >>>cache_dirty tracks whether we have been eliding the clflush. > >>> > >>>cache_clean tracks whether we know the cache has been completely > >>>clflushed. > >> > >>Can we know that with speculative prefetching and whatnot? > > > >"The memory attribute of the page containing the affected line has no > >effect on the behavior of this instruction. It should be noted that > >processors are free to speculative fetch and cache data from system > >memory regions assigned a memory-type allowing for speculative reads > >(i.e. WB, WC, WT memory types). The Streaming SIMD Extensions PREFETCHh > >instruction is considered a hint to this speculative behavior. Because > >this speculative fetching can occur at any time and is not tied to > >instruction execution, CLFLUSH is not ordered with respect to PREFETCHh > >or any of the speculative fetching mechanisms (that is, data could be > >speculative loaded into the cache just before, during, or after the > >execution of a CLFLUSH to that cache line)." > > > >which taken to the extreme means that we can't get away with this trick. > > > >If we can at least guarantee that such speculation can't extend beyond > >a page boundary that will be enough to assert that the patch is valid. > > > >Hopefully someone knows a CPU guru or two. > > Found some relevant info at the link https://lwn.net/Articles/255364/ > > An excerpt from the same link > Hardware Prefetching > "Prefetching has one big weakness: it cannot cross page boundaries. > The reason should be obvious when one realizes that the CPUs support > demand paging. If the prefetcher were allowed to cross page > boundaries, the access might trigger an OS event to make the page > available. This by itself can be bad, especially for performance. > What is worse is that the prefetcher does not know about the > semantics of the program or the OS itself. It might therefore > prefetch pages which, in real life, never would be requested. That > means the prefetcher would run past the end of the memory region the > processor accessed in a recognizable pattern before. This is not > only possible, it is very likely. If the processor, as a side effect > of a prefetch, triggered a request for such a page the OS might even > be completely thrown off its tracks if such a request could never > otherwise happen. > > It is therefore important to realize that, regardless of how good > the prefetcher is at predicting the pattern, the program will > experience cache misses at page boundaries unless it explicitly > prefetches or reads from the new page. This is another reason to > optimize the layout of data as described in Section 6.2 to minimize > cache pollution by keeping unrelated data out. > > Because of this page limitation the processors do not have terribly > sophisticated logic to recognize prefetch patterns. With the still > predominant 4k page size there is only so much which makes sense. > The address range in which strides are recognized has been increased > over the years, but it probably does not make much sense to go > beyond the 512 byte window which is often used today. Currently > prefetch units do not recognize non-linear access pattern" How best to summarise? Add something like * ... * After clflushing we know that this object cannot be in the * CPU cache, nor can it be speculatively loaded into the CPU * cache as our objects are page-aligned (and speculation cannot * cross page boundaries). Whilst this flag is set, we know that * any future access to the object's pages will miss the stale * cache and have to be serviced from main memory, i.e. we do * not need another clflush to invalidate the CPU cache. ? -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx