Re: freeing memory in a shared library affects the performance of a program that uses it

Alberto Gcchelp via Gcc-help <gcc-help@xxxxxxxxxxx> · Tue, 25 May 2021 10:29:22 +0200

> If the time is where you indicated, while the slowdown is present, then
the likely cause is CPU cache misses.  Those
> cache misses could be caused by reuse of the fragmented memory freed by
that line in the library (vs. if those many
> fragments were not freed, the subsequent allocations would take a
contiguous chunk of additional address space,
> which might be more cache friendly).

I believe that that diagnosis may explain what I'm observing.

I profiled several testcases and, as before, most of the runtime is spent
in the matrix-vector products.

If I call gmsh::finalize, matrix-vector products take up to 2 times longer
than if I don't. Other parts of the programs aren't significantly affected.

There are no allocations or deallocations in those matrix-vector products.
The instructions involved should be approximately those that I pasted
bellow. IDVecVec is a std::vector<std::vector<std::size_t>> containing the
indices of  each cell's neighbour cells.

It's still surprising to me that freeing memory in a shared library, when
there is plenty of free RAM available (forgot to mention that my testcases
consume very little memory), affects the performance of a totally unrelated
code. Is there a remedy other than not calling gmsh::finalize?

The good thing is that I should be able to prepare a more or less reduced
testcase for the Gmsh devs to test.

Thanks so much for your help!

        push    r15
        push    r14
        push    r13
        push    r12
        push    rbp
        push    rbx
        mov     rbp, QWORD PTR [rdi]
        test    rbp, rbp
        je      .L23
        mov     rax, QWORD PTR [rsi+24]
        mov     r14, QWORD PTR [rdi+8]
        mov     r11, QWORD PTR [rax]
        mov     rax, QWORD PTR [rsi+8]
        mov     r15, QWORD PTR [rsi+32]
        mov     r13, QWORD PTR [rax+8]
        mov     rax, QWORD PTR [rsi]
        mov     r10, QWORD PTR VF::TMalla<2ul>::IDVecVec[rip]
        mov     r12, QWORD PTR [rax+8]
        vmovsd  xmm3, QWORD PTR .LC1[rip]
        mov     rbx, rsi
        sal     rbp, 3
        xor     r9d, r9d
        vxorpd  xmm4, xmm4, xmm4
.L16:
        mov     rax, QWORD PTR [r11+8+r9*2]
        mov     rcx, QWORD PTR [r11+r9*2]
        mov     rdx, QWORD PTR [r10]
        lea     rdi, [rax+rcx*8]
        mov     rsi, QWORD PTR [r10+8]
        cmp     rax, rdi
        je      .L18
        cmp     rdx, rsi
        je      .L18
        mov     r8, QWORD PTR [r15+8]
        vmovsd  xmm0, xmm4, xmm4
.L14:
        mov     rcx, QWORD PTR [rdx]
        vmovsd  xmm5, QWORD PTR [rax]
        add     rdx, 8
        vfmadd231sd     xmm0, xmm5, QWORD PTR [r8+rcx*8]
        add     rax, 8
        cmp     rsi, rdx
        je      .L13
        cmp     rdi, rax
        jne     .L14
.L13:
        vmovsd  xmm1, QWORD PTR [r12+r9]
        vdivsd  xmm2, xmm3, QWORD PTR [rbx+16]
        vmulsd  xmm1, xmm1, QWORD PTR [r13+0+r9]
        add     r10, 24
        vfmadd132sd     xmm1, xmm0, xmm2
        vmovsd  QWORD PTR [r14+r9], xmm1
        add     r9, 8
        cmp     rbp, r9
        jne     .L16
.L23:
        pop     rbx
        pop     rbp
        pop     r12
        pop     r13
        pop     r14
        pop     r15
        ret
.L18:
        vmovsd  xmm0, xmm4, xmm4
        jmp     .L13