On Mon, Oct 5, 2015 at 5:23 PM, Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> wrote: > > One thing I've been noticing on Skylake is that barriers (implicit and > explicit) are showing up more in profiles. Ahh, you're on skylake? It's entirely possible that the issue is that the whole "stac/mov/clac" is much more expensive because skylake actually ends up supporting those AC instructions. That would make sense. We could probably do them outside the loop, rather than tightly around the actual move instructions. Peter (hpa), is there some sane interface to try to do that? > What we're seeing here > probably isn't actually stac/clac overhead, but the cost of finishing > some other operations that are outstanding before we can proceed through > here. I suspect it actually _is_ stac/clac overhead. It might well be that clac/stac ends up serializing loads some way. Last I heard, they were reasonably cheap but certainly not free - and when we're talking about something that just loops over bringing the line into cache, it might be relatively expensive. How did you do the profile? Use "-e cycles:pp" to get the precise profile information, which should actually attribute the cost to the instruction that really causes it. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html