On Sun, Mar 12, 2017 at 01:21:12PM +0000, Chris Wilson wrote: > On Fri, Mar 10, 2017 at 05:14:32PM -0800, Kenneth Graunke wrote: > > On systems without LLC, drm_intel_gem_bo_map_unsynchronized() has > > had the surprising behavior of doing a synchronized GTT mapping. > > This is obviously not what the user of the API wanted. > > > > Eric left a comment indicating a valid concern: if the CPU and GPU > > caches are incoherent, we don't keep track of where the user last > > mapped the buffer, and what caches might contain relevant data. > > Note this is an issue in libdrm_intel not tracking the cache domain > transitions. Even just a switch between cpu and coherent would solve the > majority of that - the caveat being shared bo where the tracking is > incomplete. > > > Modern Atom systems still don't have LLC, but they do offer snooping, > > which effectively makes the caches coherent. The kernel appears to > > set up the PTE/PPAT to enable snooping for everything where the cache > > level is not I915_CACHE_NONE. As far as I know, only scanout buffers > > are marked as uncached. > > Byt, bsw beg to differ. I don't have a bxt to know the results of the > igt/kernel tests. Just give me a list of the tests to run (and, if any, what patches to apply and the debugging level you want enabled) and I'll provide the necessary results. > > Any buffers used by scanout should be flagged as non-reusable with > > drm_intel_bo_disable_reuse(), prime export, or flink. So, we can > > assume that any reusable buffer should be snooped. > > Not really, there is no reason why scanout buffers can't be reused. > > > This patch enables unsynchronized mappings for reusable buffers > > on all Gen6+ hardware (which have either LLC or snooping). > > > > On Broxton, this improves the performance of Unigine Valley 1.0 > > on Low settings at 1280x720 by about 45%, and Unigine Heaven 4.0 > > (same settings) by about 53%. > > Does anyone have figures for gtt performance on bxt - does it cover over > the same performance penalty from earler atoms? Basically why bother to > enable this over wc mapping (no stalls for a contended, limited > resource) + detiling. (Just note that for detiling Y to WC you need to > use a temporary cacheable page, or rearrange the code to make sure the > reads/writes are in 64 byte chunks.) Again, I can run any tests you'd like to get numbers from, just give me a list. Kind regards, David _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx