> > Hi, > yes that seems bit weird. It is bit better as it does not modify common code, > but still... Maybe going back to your original idea of replacing memcpy, try > replacing it with readq? It should generate one instruction read (although it is > only for x64_64, for 32 bit kernel we would still need to do something else). > > Thanks, > Amadeusz Hi, I've compared the assembly to see if there is clue. Both kernels are using 64-bit mov to read register and the only difference is optimized or not. Both implementations are looking good to me. Currently I don't have answer why slower kernel hits the problem while optimized one survived. 1. Old kernel. Code is optimized and not able to reproduce the issue on this kernel. (gdb) disas sst_shim32_read64 Dump of assembler code for function sst_shim32_read64: 0x000000000000096c <+0>: call 0x971 <sst_shim32_read64+5> => call __fentry__ 0x0000000000000971 <+5>: push rbp 0x0000000000000972 <+6>: mov rbp,rsp 0x0000000000000975 <+9>: mov eax,esi 0x0000000000000977 <+11>: mov rax,QWORD PTR [rdi+rax*1] => perform 64-bit mov 0x000000000000097b <+15>: pop rbp 0x000000000000097c <+16>: ret End of assembler dump. 2. New kernel: obviously optimization is disabled and it calls memcpy to do the read operation. (gdb) disas sst_shim32_read64 Dump of assembler code for function sst_shim32_read64: 0x00000000000009a8 <+0>: call 0x9ad <sst_shim32_read64+5> => call __fentry__ 0x00000000000009ad <+5>: push rbp 0x00000000000009ae <+6>: mov rbp,rsp 0x00000000000009b1 <+9>: push rbx 0x00000000000009b2 <+10>: sub rsp,0x10 0x00000000000009b6 <+14>: mov rax,QWORD PTR gs:0x28 0x00000000000009bf <+23>: mov QWORD PTR [rbp-0x10],rax 0x00000000000009c3 <+27>: movabs rax,0xaaaaaaaaaaaaaaaa 0x00000000000009cd <+37>: lea rbx,[rbp-0x18] 0x00000000000009d1 <+41>: mov QWORD PTR [rbx],rax 0x00000000000009d4 <+44>: mov esi,esi 0x00000000000009d6 <+46>: add rsi,rdi 0x00000000000009d9 <+49>: mov edx,0x8 0x00000000000009de <+54>: mov rdi,rbx 0x00000000000009e1 <+57>: call 0x9e6 <sst_shim32_read64+62> => call memcpy The memcpy is implemented in arch/x86/lib/memcpy_64.S (gdb) disas memcpy Dump of assembler code for function memcpy: 0xffffffff813519c0 <+0>: jmp 0xffffffff813519f0 <memcpy_orig> => jump to memcpy_orig function X86_FEATURE_ERMS is disabled so it jumps to memcpy_orig (gdb) disas memcpy_orig Dump of assembler code for function memcpy_orig: 0xffffffff813519f0 <+0>: mov rax,rdi 0xffffffff813519f3 <+3>: cmp rdx,0x20 0xffffffff813519f7 <+7>: jb 0xffffffff81351a77 <memcpy_orig+135> => jump because our read size is 8 ... 0xffffffff81351a77 <+135>: cmp edx,0x10 0xffffffff81351a7a <+138>: jb 0xffffffff81351aa0 <memcpy_orig+176> => jump because our read size is 8 ... 0xffffffff81351aa0 <+176>: cmp edx,0x8 0xffffffff81351aa3 <+179>: jb 0xffffffff81351ac0 <memcpy_orig+208> 0xffffffff81351aa5 <+181>: mov r8,QWORD PTR [rsi] 0xffffffff81351aa8 <+184>: mov r9,QWORD PTR [rsi+rdx*1-0x8] 0xffffffff81351aad <+189>: mov QWORD PTR [rdi],r8 0xffffffff81351ab0 <+192>: mov QWORD PTR [rdi+rdx*1-0x8],r9 => perform 64-bit mov twice over same address (rdx=0x8) 0xffffffff81351ab5 <+197>: ret Regards, Brent