If the experts here tell us this is too far off topic for the mailing list, I'm willing to help more via direct email and/or try to join the other forum where you asked. > There are no allocations or deallocations in those matrix-vector products. You do seem to have confirmed the theory that the problem is a CPU cache issue, and ruled out the theory that the problem is the return of a flaw (fixed long ago) in the GNU code for managing the pool free chunks. > The instructions involved should be approximately those that I pasted bellow. While the asm version of the code is relevant in diagnosing many performance issues, for a CPU cache issue, the asm is a distraction and the C++ code would allow a more informative discussion. > There are no allocations or deallocations in those matrix-vector products. The allocations you do after calling gmsh::finalize are not directly slow, but (like gmsh::finalize itself) they still can be the cause of the later slowdown. > The good thing is that I should be able to prepare a more or less reduced testcase for the Gmsh devs to test. I still think that is asking a bit much of them, unless you could somehow make the case that a memory fragmentation driven cache issue in subsequent code is likely to be occurring in common use of their library and simply not measured by other users.I expect reducing the memory fragmentation that their code causes would be a big effort only justified if they believed many users would benefit. > It's still surprising to me that freeing memory in a shared library, when there is plenty of free RAM available (forgot to mention that my testcases consume very little memory), affects the performance of a totally unrelated code. Is there a remedy other than not calling gmsh::finalize? There is certainly a remedy other than not freeing that memory. It probably isn't even necessary to understand the mechanism of the CPU cache problem. What you do need to understand is the allocation of the objects that are heavily accessed by the code that you have identified (by profile) as slow. The symptoms imply at least some of those objects were allocated after the call to gmsh:finalize and changing the nature or allocation of those objects is the key to speeding up your code.Again, I could be much less vague if I saw the C++ code (the slow function, its caller .., up to the level of the real owner of the data involved). In the gmsh code, a vector of pointers to objects was used in places where a vector of objects would be simpler and more cache friendly. The comments indicated that the reason was that the objects are polymorphic, so a vector of objects would not work. Are you similarly using a vector of pointers? If so, do you have a similar valid reason for using that instead of a vector of objects? If there is a good reason for that pointer layer, there are still ways to fix the cache issue. But if a pointer layer serves no real purpose then eliminating it may be the simplest solution to the cache issue. Again, if I saw the relevant code, I could be less abstract about the changes I'm suggesting.