On Sat, Nov 01, 2008 at 12:19:26AM +0000, Shawn O. Pearce wrote: > Pierre Habouzit <madcoder@xxxxxxxxxx> wrote: > > On Fri, Oct 31, 2008 at 11:49:11PM +0000, Junio C Hamano wrote: > > > > > > I understand that the apidocs/ is a very early work-in-progress, but > > > still, it bothers me that it is unclear to me what lifetime rules are in > > > effect on the in-core objects. For example, in C-git, commit objects are > > > not just parsed but are modified in place as history is traversed > > > (e.g. their flags smudged and their parents simplified). You have "flags" > > > field in commit, which implies to me that the design shares this same > > > "modified by processing in-place" assumption. > > > > I don't think it's impossible to have something efficient without this > > kind of hacks. You just need to dissociate the objects from their > > annotations, though use some kind of allocator that allow numbering of > > the objects, and use that number as a lookup in an array of annotations. > > It will require pool allocators for the annotations, but that should > > work fine and efficientely. > > Interesting approach. I don't know why I didn't think of that one. > > You'll still need to be able to toss parts of the git graph though. > If you just pin everything in memory under a single global object > table you'll run server processes out of memory as they chug through > large numbers of repositories. Sure, but for that you just need to reinject the numbers into some kind of free list (hint a bitmap) to reuse old slots. Of course this is some kind of take-once never-release approach _BUT_ one can do better and "defrag" this at times. E.g. for a server if we take your idea, there are some times we probably *know* nobody has kept a reference to one of the pointers and we can reorganize some pointers around to free chunks of data not allocated. An escape way is to use mmap + madvise for those pools, the former to allocate the memory and the latter to drop large unused ranges when needed (remapping with MAP_FIXED is also supposed to work but fails on Macos it seems). Even win32 has what it takes to do so (just skim through the code of jemalloc, that is used for mozilla, it's quite portable on POSIX + Windows). I was thinking that one should create and register pools of annotations as such, define the size of the annotation (IOW the size of cell), and let deal with that. I imagine the stuff as some kind of allocator that would allocate the first time e.g. 4k of objects, and if you need more 8k, and if need more 16k and so on exponentially. Mapping an integer to a cell in this kind of ropes is _really_ efficient: you need to know the first bit set (__builtin_clz is of help with gcc, it can be emulated with proper asm for many non-gcc platforms also) to give you the number of the rope component that you intend to address, you clear that bit, the resulting number is the index in that rope component of the cell you're interested in. On most machines such an operation would be a few cycles, which should be fairly little wrt the operation you would do during a traversal. Maintaining the bitmap is important, so that we can give back large chunks of physical memory back to the system, and many malloc implementations would likely use such kind of implementations, we would just do our (which is arguably heavy) but gives us control on the cell number, which is just what we need. Of _course_ when we have a better / more natural place to put annotations we need, let's do it ! But the kind of thing I just imagined is more generic and can help to do arbitrary complex stuff during a traversal. We may want to explore a storage that is aware of the objects type also, as I expect the liveness of objects to depend on the objects type a lot, and some traversal to not care about some kind of objects at all, which when combined with lazy allocation in the annotations pools, could end up with reduced memory footprint. We could e.g. use two bits of the "object handler" to store the type and dispatch into 3 (commit, tree, blob) ropes instead of a big one. The object and annotations store _is_ what makes git efficient, and is the very service the library will have to support well. We _will_ have to put some clever code in there and we cannot sacrifice any kind of performance, this will be the tight loop every time. I don't think there will be anything else in git that is so central for performance. On the other hand, it's probably ok for the first versions of the library to have some things in it that don't perform well in long lived processes (repacking e.g., if we ever want to do that in the library, as it's slow and that forking a git-repack away should not be a too expensive cost anyway). -- ·O· Pierre Habouzit ··O madcoder@xxxxxxxxxx OOO http://www.madism.org
Attachment:
pgpbXsigg4Z3a.pgp
Description: PGP signature