On Tue, Jan 12, 2016 at 6:06 PM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > On Tue, Jan 12, 2016 at 4:55 PM, Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> wrote: >> >> The double clflush() remains a mystery. > > Actually, I think it's explainable. > > It's wrong to do the clflush *after* the GPU has done the write, which > seems to be what you are doing. > > Why? > > If the GPU really isn't cache coherent, what can happen is: > > - the CPU has the line cached > > - the GPU writes the data > > - you do the clflushopt to invalidate the cacheline > > - you expect to see the GPU data. > > Right? > > Wrong. The above is complete crap. > > Why? > > Very simple reason: the CPU may have had the cacheline dirty at some > level in its caches, so when you did the clflushopt, it didn't just > invalidate the CPU cacheline, it wrote it back to memory. And in the > process over-wrote the data that the GPU had written. > > Now you can say "but the CPU never wrote to the cacheline, so it's not > dirty in the CPU caches". That may or may not be trie. The CPU may > have written to it quite a long time ago. > > So if you are doing a GPU write, and you want to see the data that the > GPU wrote, you had better do the clflushopt long *before* the GPU ever > writes to memory. > > Your pattern of doing "flush and read" is simply fundamentally buggy. > There are only two valid CPU flushing patterns: > > - write and flush (to make the writes visible to the GPU) > > - flush before starting GPU accesses, and then read > > At no point can "flush and read" be right. > > Now, I haven't actually seen your code, so I'm just going by your > high-level description of where the CPU flush and CPU read were done, > but it *sounds* like you did that invalid "flush and read" behavior. Since barriers are on my mind: how strong a barrier is needed to prevent cache fills from being speculated across the barrier? Concretely, if I do: clflush A clflush B load A load B the architecture guarantees that (unless store forwarding happens) the value I see for B is at least as new as the value I see for A *with respect to other access within the coherency domain*. But the GPU isn't in the coherency domain at all. Is: clflush A clflush B load A MFENCE load B good enough? If it is, and if clflush A clflush B load A LOCK whatever load B is *not*, then this might account for the performance difference. In any event, it seems to me that what i915 is trying to do isn't really intended to be supported for WB memory. i915 really wants to force a read from main memory and to simultaneously prevent the CPU from writing back to main memory. Ick. I'd assume that: clflush A clflush B load A serializing instruction here load B is good enough, as long as you make sure that the GPU does its writes after the clflushes make it all the way out to main memory (which might require a second serializing instruction in the case of clflushopt), but this still relies on the hardware prefetcher not prefetching B too early, which it's permitted to do even in the absence of any explicit access at all. Presumably this is good enough on any implementation: clflush A clflush B load A clflush B load B But that will be really, really slow. And you're still screwed if the hardware is permitted to arbitrarily change cache lines from S to M. In other words, I'm not really convinced that x86 was ever intended to have well-defined behavior if something outside the coherency domain writes to a page of memory while that page is mapped WB. Of course, I'm also not sure how to reliably switch a page from WB to any other memory type short of remapping it and doing CLFLUSH after remapping. SDM Volume 3 11.12.4 seems to agree with me. Could the driver be changed to use WC or UC and to use MOVNTDQA on supported CPUs to get the performance back? It sounds like i915 is effectively doing PIO here, and reasonably modern CPUs have a nice set of fast PIO instructions. --Andy _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/dri-devel