On Sat, 24 Aug 2024, Yunhui Cui wrote: > Compared to directly fetching the per-CPU offset from memory (or cache), > using the global pointer (gp) to store the per-CPU offset can save one > memory access. Yes! That is a step in the right direction. Is there something like gp relative addressing so that we can do loads and stores relative to gp as well? Are there atomics that can do read modify write relative to GP? That would get you to comparable per cpu efficiency to x86. x86 can do relative addressing and RMV in one instruction which allows one to drop the preempt enable/disable since one instruction cannot be interrupted.