Re: libgit2 - a true git library

Pierre Habouzit <madcoder@xxxxxxxxxx> · Sat, 01 Nov 2008 02:02:11 +0100

On Sat, Nov 01, 2008 at 12:19:26AM +0000, Shawn O. Pearce wrote:
> Pierre Habouzit <madcoder@xxxxxxxxxx> wrote:
> > On Fri, Oct 31, 2008 at 11:49:11PM +0000, Junio C Hamano wrote:
> > > 
> > > I understand that the apidocs/ is a very early work-in-progress, but
> > > still, it bothers me that it is unclear to me what lifetime rules are in
> > > effect on the in-core objects.  For example, in C-git, commit objects are
> > > not just parsed but are modified in place as history is traversed
> > > (e.g. their flags smudged and their parents simplified).  You have "flags"
> > > field in commit, which implies to me that the design shares this same
> > > "modified by processing in-place" assumption.
> > 
> > I don't think it's impossible to have something efficient without this
> > kind of hacks. You just need to dissociate the objects from their
> > annotations, though use some kind of allocator that allow numbering of
> > the objects, and use that number as a lookup in an array of annotations.
> > It will require pool allocators for the annotations, but that should
> > work fine and efficientely.
> 
> Interesting approach.  I don't know why I didn't think of that one.
> 
> You'll still need to be able to toss parts of the git graph though.
> If you just pin everything in memory under a single global object
> table you'll run server processes out of memory as they chug through
> large numbers of repositories.

Sure, but for that you just need to reinject the numbers into some kind
of free list (hint a bitmap) to reuse old slots. Of course this is some
kind of take-once never-release approach _BUT_ one can do better and
"defrag" this at times.

E.g. for a server if we take your idea, there are some times we probably
*know* nobody has kept a reference to one of the pointers and we can
reorganize some pointers around to free chunks of data not allocated.

An escape way is to use mmap + madvise for those pools, the former to
allocate the memory and the latter to drop large unused ranges when
needed (remapping with MAP_FIXED is also supposed to work but fails on
Macos it seems).  Even win32 has what it takes to do so (just skim
through the code of jemalloc, that is used for mozilla, it's quite
portable on POSIX + Windows).

I was thinking that one should create and register pools of annotations
as such, define the size of the annotation (IOW the size of cell), and
let deal with that. I imagine the stuff as some kind of allocator that
would allocate the first time e.g. 4k of objects, and if you need more
8k, and if need more 16k and so on exponentially. Mapping an integer to
a cell in this kind of ropes is _really_ efficient: you need to know the
first bit set (__builtin_clz is of help with gcc, it can be emulated
with proper asm for many non-gcc platforms also) to give you the number
of the rope component that you intend to address, you clear that bit,
the resulting number is the index in that rope component of the cell
you're interested in. On most machines such an operation would be a few
cycles, which should be fairly little wrt the operation you would do
during a traversal.

Maintaining the bitmap is important, so that we can give back large
chunks of physical memory back to the system, and many malloc
implementations would likely use such kind of implementations, we would
just do our (which is arguably heavy) but gives us control on the cell
number, which is just what we need.

Of _course_ when we have a better / more natural place to put
annotations we need, let's do it ! But the kind of thing I just imagined
is more generic and can help to do arbitrary complex stuff during a
traversal.

We may want to explore a storage that is aware of the objects type also,
as I expect the liveness of objects to depend on the objects type a lot,
and some traversal to not care about some kind of objects at all, which
when combined with lazy allocation in the annotations pools, could end
up with reduced memory footprint. We could e.g. use two bits of the
"object handler" to store the type and dispatch into 3 (commit, tree,
blob) ropes instead of a big one.

The object and annotations store _is_ what makes git efficient, and is
the very service the library will have to support well. We _will_ have
to put some clever code in there and we cannot sacrifice any kind of
performance, this will be the tight loop every time.  I don't think
there will be anything else in git that is so central for performance.

On the other hand, it's probably ok for the first versions of the
library to have some things in it that don't perform well in long lived
processes (repacking e.g., if we ever want to do that in the library, as
it's slow and that forking a git-repack away should not be a too
expensive cost anyway).
-- 
·O·  Pierre Habouzit
··O                                                madcoder@xxxxxxxxxx
OOO                                                http://www.madism.org
Attachment:
pgpbXsigg4Z3a.pgp

Description: PGP signature