On Sat, Apr 13, 2013 at 11:04:47PM -0700, Junio C Hamano wrote: > From: Jeff King <peff@xxxxxxxx> > > The "indegree" field in the commit object is only used while sorting > a list of commits in topological order, and wasting memory otherwise. > > We would prefer to shrink the size of individual commit objects, > which we may have to hold thousands of in-core. We could eject > "indegree" field out from the commit object and represent it as a > dynamic table based on the decoration infrastructure, but the > decoration is meant for sparse annotation and is not a good match. > > Instead, let's try a different approach. > > - Assign an integer (commit->index) to each commit we keep in-core > (reuse the space of "indegree" field for it); > > - When running the topological sort, allocate an array of integers > in bulk (called "slab"), use the commit->index as an index into > this array, and store the "indegree" information there. > > This does _not_ reduce the memory footprint of a commit object, but > the commit->index can be used as the index to dynamically associate > commits with other kinds of information as needed. > > Signed-off-by: Junio C Hamano <gitster@xxxxxxxxx> Thanks, this writeup makes sense to me. Obviously, Signed-off-by: Jeff King <peff@xxxxxxxx> I had planned to clean it up into a more general form, as you did in patch 3, before submitting it, but doing it like this with the cleanup on top is fine with me (and it makes the attribution of the work much clearer). The rest of this email ended up long and rambly, and is kind of a dump of the pending thoughts I had on the slab approach. Don't take it as "do not do this patch", but rather "here are some areas to explore on top". One thing you did gloss over a bit is "the decoration is meant for sparse annotation". I do wonder what the performance would look like for implementing --topo-order fully via "struct decoration". Given that our hash function is a direct memory access of object->sha1, and that with a reasonable load factor in the table, most lookups should not hit any collision, it is not too expensive to do a lookup. I think the biggest cost would be that we have to do a full hashcmp() for each access to actually _check_ the collision. So it probably is noticeably slower due to that, but I didn't actually measure it. Assuming the slab technique is faster, I suspect there are some "struct decoration" cases that should be switched. Unfortunately the most obvious one to me would be the mark storage in fast-export, but that is used for object types other than commit. And I don't think we want to move the slab index into "struct object", for two reasons: 1. It bloats the size of blob and tree objects, which we have more of than commits (especially in something like "rev-list --objects --all"). 2. Your commit indices become less dense, so you have to use a bigger slab, which is wasteful and will have worse memory cache performance. We could maybe mitigate that by keeping a per-type index counter, but that complicates the slab lookup (it would have to dereference based on type). I also note that we still have commit->util, which serves a similar purpose to the slab index, and is often used when we are attaching data to a lot of commits (and in fact, --topo-sort could probably just be implemented via commit->util). Comparing a slab index versus a util pointer, I think the differences should be: 1. with slab, you have an extra level of indirection ("x + c->index" rather than "c->util"), which might be measurably slower. 2. with slab, you do not have to worry about conflicting with other users of "struct commit" in the same program 3. with util, you can squeeze values smaller than a pointer into "struct commit"; technically you can do this with the slab index, too, but that would negate (2) above. However, I don't think we even use "util" that way anywhere currently. 4. util pointers can point multiple commits to the same chunk of memory. A slab index can do that, too, but you have an extra level of indirection (you store a slab of pointers to items, some of which are shared). 5. with a util pointer, we typically point to an malloc'd chunk of memory. With a slab, you avoid per-commit allocation in favor of allocating a big chunk of slab at once. However, you should be able to do the same thing with a util pointer (i.e., allocate from a pool and just point into it). So I think the util pointer is strictly better, except for point (2). And the main cost is point (1), the extra indirection. So if we can measure (1) and decide it isn't a big cost, the maintainability improvement for (2) is probably worth it, and we can rip out the util pointer entirely, saving us some extra space in "struct commit". -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html