Re: Easy way to try out alternate allocators?

tom fogal <tfogal@xxxxxxxxxxxxxx> · Fri, 16 Oct 2009 16:00:14 -0600

This is a move from a thread (incorrectly) started on the libstdc++ ML.
Another reply there has already pointed me towards:

  http://gcc.gnu.org/onlinedocs/libstdc++/manual/configure.html

and --enable-libstdcxx-allocator, which looks promising, but I have not
tried yet.

Ed Smith-Rowland <3dw4rd@xxxxxxxxxxx> writes:
> tom fogal wrote:
> > I'm finding our app is slow where it shouldn't be, and some
> > profiling has revealed we're spending a significant amount of time
> > allocating memory.  Confusingly, oprofile tells me we're spending
> > 99% of our time in operator new, yet sysprof says the major
> > offender is vector's push_back.
> >
> > Is there any compile-time flag to enable a different allocator
> > globally?  I'm hoping to find a _GLIBCXX_DEBUG but for, say,
> > __mt_alloc.  Unfortunately there are a *lot* of vectors involved
> > (which is probably why its slow ... copying between them), so
> > specifying an explicit allocator on each of them would be painful.
> >
> > Other debugging / performance suggestions welcome as well.  Thanks,
>
> Have you tried reserve to grab a bunch of memory before hand?

To a limited extent.  The issue is that we've got a loop that calls
a method which generates a vector and passes that vector to another
submodule.  The submodule, in turn, decomposes the argument vector into
its own internal vector.  In pseudo-code:

  foreach s in something:
    generate_data(s)

  generate_data(s):
    foreach x in s:
      create a vector, `v'
      call_submodule(v)

  call_submodule(v):
    foreach elem in v:
      for i from 0 to 12:
      submodule_internal_vec.push_back(elem[i])

I have tried both reserve and resize at the lower levels -- in
call_submodule -- but they don't help enough.  Interestingly, when
switching to resize my time ends up in __uninitialized_copy_fill (or
maybe just __uninitialized_copy? or __uninitialized_fill?  I forget the
exact name), but it doesn't matter much: it still spends about the same
amount of time regardless of whether it's in push_back or __whatever.

Ideally, generate_data would append to a vector, and the bottom of the
stack would pass the whole vector to `submodule' *after* the loop,
instead of piecemeal.  I could guesstimate the overall size of the
vector pretty accurately there, and submodule could either grab the
pointer, or at least know the overall size and copy it in one shot.

Unfortunately, making that transformation is involved.

> I'm surprised vector is slow as opposed to list or something since
> the former allocated memory in chunks.

I actually haven't tried a list or a deque; like you, I assume it will
be significantly more expensive (and it's certainly expensive in terms
of time to implement -- I might as well fix it, in the manner I've
outline above).

> push_back could allocate memory to grow the vector as necessary but
> it should do so in blocks according to some heuristic.

I'm not sure what libstdc++'s heuristic is, but we're probably
straining it here. `something' is 128x80x90 elements here, and 128^3
would still be considered `really small'.  Multiply that by my guess of
24 elems 'foreach x in s', and multiply it again by my not-guess of 12
elements per `x'.

My guess is that most people are using std::vectors with much smaller
datasets, and any heuristics would be designed for such smaller
datasets.  That's entirely unverified.

> Also, the constructor for your objects is run.  Objects are copied.
> Are there a lot of things inside your elements that need to be heap
> allocated?

Looks like no.  These are simple, not-quite-POD primitives that every
graphics programmer seems to develop independently: std::vector's of
Point, Vector, and double.

> If you have a good swap method for your elements things could be
> better,

Ahh, defining my own swap!  I'd forgotten about that; good advice.
Thanks!

> especially if you compile in -std=c++0x the elements are moved
> instead of copied for further savings.  C++0X is supposed to bring
> something called a scoped allocator.  I have to admit I'm new to that
> though.

Don't I wish.  Unfortunately 0X is a long way out for some of us.  I
just managed convincing people around here that we should use tr1
features; MS didn't even support it until something like a year ago.
If that's any indicator, I can start making the argument for C++0x in
2020... :(

Thanks for the discussion,

-tom