On Thu, Jan 25, 2018 at 2:06 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote: > Derrick Stolee <stolee@xxxxxxxxx> writes: > >> Add document specifying the binary format for packed graphs. This >> format allows for: >> >> * New versions. >> * New hash functions and hash lengths. >> * Optional extensions. >> >> Basic header information is followed by a binary table of contents >> into "chunks" that include: >> >> * An ordered list of commit object IDs. >> * A 256-entry fanout into that list of OIDs. >> * A list of metadata for the commits. >> * A list of "large edges" to enable octopus merges. >> >> Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx> >> --- >> Documentation/technical/graph-format.txt | 88 ++++++++++++++++++++++++++++++++ >> 1 file changed, 88 insertions(+) >> create mode 100644 Documentation/technical/graph-format.txt >> >> diff --git a/Documentation/technical/graph-format.txt b/Documentation/technical/graph-format.txt >> new file mode 100644 >> index 0000000000..a15e1036d7 >> --- /dev/null >> +++ b/Documentation/technical/graph-format.txt >> @@ -0,0 +1,88 @@ >> +Git commit graph format >> +======================= > > Good that this is not saying "graph format" but is explicit that it > is about "commit". Do the same for the previous steps. Especially, > builtin/graph.c that does not have much to do with graph.c is not a > good way forward ;-) > > I do like the fact that later parents of octopus merges are moved > out of way to make the majority of records fixed length, but I am > not sure if the "up to two parents are recorded in line" is truly > the best arrangement. Aren't majority of commits single-parent, > thereby wasting 4 bytes almost always? git.git has ~37k non-merge commits and ~12k merge commits, (35 of them have 3 or more parents). So 75% would waste the 4 bytes of the second parent. However the first parent is still there, so any operation that only needs the first parent (git bisect --first-parent?) would still be fast. Not sure if that is common. The downside of just having one parent or pointer into the edge list would be to penalize 25% of the commit lookups with an indirection compared to ~0% (the 35 octopus'). I'd rather want to optimize for speed than disk size? (4 bytes for 37k is 145kB for git.git, which I find is not a lot).