I looked into this further. Seems like libat_load_16_i1 is implementing the load 16B as "*lock* *cmpxchg16b* (%*rdi*)" This is assuming that the CPU doesn't support 16B loads in a single transaction. How can I compile libatomics to use intrinsics for load 16B instead of LOCK cmpxchg? Appreciate your response. Satish On Wed, Feb 23, 2022 at 8:42 AM Satish Vasudeva < satish.vasudeva@xxxxxxxxxxxx> wrote: > Hi Team, > > I was looking at the hotspots in our software stack and interestingly I > see libat_load_16_i1 seems to be one of the top in the list. > > I am trying to understand why that is the case. My suspicion is some kind > of lock usage for 16B atomic accesses. > > I came across this discussion but frankly I am still confused. > https://gcc.gnu.org/legacy-ml/gcc-patches/2017-01/msg02344.html > > Do you think the overhead of libat_load_16_i1 is due to spinlock usage? > Also reading some other Intel CPU docs, it seems like the CPU does support > loading 16B in single access. In that case can we optimize this for > performance? > > Thanks and appreciate your help. > > Satish >