On Fri, Dec 20, 2019 at 03:36:51PM +0000, Christopher Lameter wrote: > On Fri, 20 Dec 2019, Tejun Heo wrote: > > > On Fri, Dec 20, 2019 at 10:34:20AM +0100, Jesper Dangaard Brouer wrote: > > > > So, my question to the uarch/percpu folks out there: Why are percpu > > > > accesses (%gs segment register) more expensive than regular global > > > > variables in this scenario. > > > > > > I'm also VERY interested in knowing the answer to above question!? > > > (Adding LKML to reach more people) > > > > No idea. One difference is that percpu accesses are through vmap area > > which is mapped using 4k pages while global variable would be accessed > > through the fault linear mapping. Maybe you're getting hit by tlb > > pressure? bpf_redirect_info is static so that should be accessed via the linear mapping as well if we're embedding the first chunk. > > And there are some accesses from remote processors to per cpu ares of > other cpus. If those are in the same cacheline then those will cause > additional latencies. > I guess we could pad out certain structs like bpf_redirect_info, but that isn't really ideal.