Thanks for the response. Looking further into libatomic library code, I do see 16B move instructions have been used for atomic_exchange code like below. Just wondering why it is not generating a intrinsic __atomic_load_16 using this instruction. *movdq**a* 0x0(%*rbp*),%*xmm0* On Thu, Feb 24, 2022 at 11:09 AM Xi Ruoyao <xry111@xxxxxxxxxxxxxxxx> wrote: > On Wed, 2022-02-23 at 08:42 -0800, Satish Vasudeva via Gcc-help wrote: > > Hi Team, > > > > I was looking at the hotspots in our software stack and interestingly I > see > > libat_load_16_i1 seems to be one of the top in the list. > > > > I am trying to understand why that is the case. My suspicion is some kind > > of lock usage for 16B atomic accesses. > > > > I came across this discussion but frankly I am still confused. > > https://gcc.gnu.org/legacy-ml/gcc-patches/2017-01/msg02344.html > > > > Do you think the overhead of libat_load_16_i1 is due to spinlock usage? > > Also reading some other Intel CPU docs, it seems like the CPU does > support > > loading 16B in single access. In that case can we optimize this for > > performance? > > Open a issue at https://gcc.gnu.org/bugzilla, with the reference to the > Intel CPU doc prove that some specific models supports loading 128-bit. > > Don't use "it seems like", nobody wants to write some nasty SSE code and > then find it doesn't work on any CPU. > -- > Xi Ruoyao <xry111@xxxxxxxxxxxxxxxx> > School of Aerospace Science and Technology, Xidian University >