Hi, I have recently been running several programs (caffe, mxnet, openblas, blis...) on aarch64. And I found performance regression when libgomp is used and OMP_NUM_THREADS is set to be >=2. Almost half of the execution time is consumed either in gomp_barrier_wait_end() or gomp_team_barrier_wait_end(). The version of libgomp I used is 5.3.1-14, which is shipped with Ubuntu 16.04. I'm wondering whether it is a known issue on aarch64. Or it might relate to some other factors of the system/hardware? Here are perf statistics (hot points) I collected from gomp_barrier_wait_end()/gomp_team_barrier_wait_end(): 84.90 ©¦ add x0, x0, ©¦ cmp x0, x2 ©¦ b.eq 11ee8 <omp_get_num_procs@@OMP_1.0+0x428> 14.91 ©¦ ldr w1, [x19] 76.59 ©¦ add x0, x0, ©¦ cmp x0, x2 ©¦ b.eq 121a0 <omp_get_num_procs@@OMP_1.0+0x6e0> 23.05 ©¦ ldr w1, [x20] Any ideas? Thanks in advance. Baozi.