Quoting Joonas Lahtinen (2017-10-09 14:36:27) > Title: s/thresh/thrash/ > > On Wed, 2017-08-23 at 13:55 +0100, Chris Wilson wrote: > > At the moment, the verify tests use an extremely brutal write-read of > > every dword, degrading performance to UC. If we break those up into > > cachelines, we can do a wcb write/read at a time instead, roughly 8x > > faster. We lose the accuracy of the forced wcb flushes around every dword, > > but we are retaining the overall behaviour of checking reads following > > writes instead. To compensate, we do check that a single dword write/read > > before using wcb aligned accesses. > > > > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > > <SNIP> > > > @@ -104,15 +109,78 @@ bo_copy (void *_arg) > > return NULL; > > } > > > > +#if defined(__x86_64__) && !defined(__clang__) > > +#define MOVNT 512 > > + > > +#pragma GCC push_options > > +#pragma GCC target("sse4.1") > > + > > +#include <smmintrin.h> > > +__attribute__((noinline)) > > +static void copy_wc_page(void *dst, void *src) > > +{ > > + if (igt_x86_features() & SSE4_1) { > > + __m128i *S = (__m128i *)src; > > + __m128i *D = (__m128i *)dst; > > + > > + for (int i = 0; i < PAGE_SIZE/CACHELINE; i++) { > > + __m128i tmp[4]; > > + > > + tmp[0] = _mm_stream_load_si128(S++); > > + tmp[1] = _mm_stream_load_si128(S++); > > + tmp[2] = _mm_stream_load_si128(S++); > > + tmp[3] = _mm_stream_load_si128(S++); > > + > > + _mm_store_si128(D++, tmp[0]); > > + _mm_store_si128(D++, tmp[1]); > > + _mm_store_si128(D++, tmp[2]); > > + _mm_store_si128(D++, tmp[3]); > > + } > > + } else > > + memcpy(dst, src, PAGE_SIZE); > > +} > > Not lib/ material? Yes. But you know it's easier to make it work for one case than all. -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx