Re: [PATCH v4 04/13] commit-graph: implement write_commit_graph()

Junio C Hamano <gitster@xxxxxxxxx> · Fri, 23 Feb 2018 11:30:07 -0800

Derrick Stolee <stolee@xxxxxxxxx> writes:

> jt/binsearch-with-fanout introduces one when there is a 256-entry
> fanout table (not the case here).
>
> The bsearch() method in search.h (and used in
> pack-write.c:need_large_offset) does not return the _position_ of a
> found element.
>
> Neither of these suit my needs, but I could just be searching for the
> wrong strings. Also, I could divert my energies in this area to create
> a generic search in the style of jt/binsearch-with-fanout.

... me goes and digs ...

What I had in mind was the one in sha1-lookup.c, actually.  Having
said that, hand-rolling another one is not too huge a deal;
eventually people will notice and consolidate code after the series
stabilizes anyway ;-)

>>> +				num_large_edges++;
>>> +				parent = parent->next;
>>> +			} while (parent);
>> It feels somewhat wasteful to traverse the commit's parents list
>> only to count, without populating the octopus table (which I
>> understand is assumed to be minority case under this design).
>
> Since we are writing the commit graph file in-order, we cannot write
> the octopus table until after the chunk lengths are known.

Oh, my "minority case" comment was me wondering "since we expect
there are only a few, why not keep them in memory while we discover
them here, so that the writing phase that come later do not have to
go through all commits again counting their parents?  would it be
more performant and a better trade-off?"  We can certainly do such
an optimization later (iow, it is not all that crucial issue and
certainly I didn't mention the above as something that needs to be
"fixed"--there is nothing broken).

> store the octopus table in memory and then dump it into the file
> later, but walking the parents is quite fast after all the commits are
> loaded. I'm not sure the time optimization merits the extra complexity
> here. (I'm happy to revisit this if we do see this performance
> lacking.)
>
> P.S. I really like the name "octopus table" and will use that for
> informal discussions of this format.

I actually do not mind that name used as the official term.  I find
it far more descriptive and understandable than "long edge" / "large
edge" at least to a Git person.

> You're right that int_id isn't great, and your more-specific
> "oid_table_pos" shows an extra reason why it isn't great: when we add
> the GRAPH_LAST_EDGE bit or set it to GRAPH_PARENT_MISSING, the value
> is NOT a table position.

Perhaps I am somewhat biased, but it is quite natural for our
codebase and internal API to say something like this:

    x_pos(table, key) function's return value is the non-negative
    position for the key in the table when the key is there; when it
    returns a negative value, it is (-1 - position) where the "position"
    is the position in the table they key would have been found if
    it was in there.

and store the return value from such a function in a variable called
"pos".  Surely, sometimes "pos" does not have _the_ position, but
that does not make it a bad name.

Saying "MISSING is a special value that denotes 'nothing is here'"
and allowing it to be set to a variable that meant to hold the
position is not such a bad thing, though.  After all, that is how
you use NULL as a special value for a pointer variable ;-).

Same for using the high bit to mean something else.  Taking these
together you would explain "low 31-bit of pos holds the position for
the item in the table.  MISSING is a special value that you can use
to denote there is nothing.  The LAST_EDGE bit denotes that one
group of positions ends there", or something like that.

> I think the current name makes the following call very clear:

It is still a strange name nevertheless.

>>> +char *write_commit_graph(const char *obj_dir)
>>> +{
>>> +	struct packed_oid_list oids;
>>> +	struct packed_commit_list commits;
>>> +	struct sha1file *f;
>>> +	int i, count_distinct = 0;
>>> +	DIR *info_dir;
>>> +	struct strbuf tmp_file = STRBUF_INIT;
>>> +	struct strbuf graph_file = STRBUF_INIT;
>>> +	unsigned char final_hash[GIT_MAX_RAWSZ];
>>> +	char *graph_name;
>>> +	int fd;
>>> +	uint32_t chunk_ids[5];
>>> +	uint64_t chunk_offsets[5];
>>> +	int num_chunks;
>>> +	int num_long_edges;
>>> +	struct commit_list *parent;
>>> +
>>> +	oids.nr = 0;
>>> +	oids.alloc = (int)(0.15 * approximate_object_count());
>> Heh, traditionalist would probably avoid unnecessary use of float
>> and use something like 1/4 or 1/8 ;-)  After all, it is merely a
>> ballpark guestimate.
>>
>>> +	num_long_edges = 0;
>> This again is about naming, but I find it a bit unnatural to call
>> the edge between a chind and its octopus parents "long".  Individual
>> edges are not long--the only thing that is long is your "list of
>> edges".  Some other codepaths in this patch seems to call the same
>> concept with s/long/large/, which I found somewhat puzzling.
>
> How about "num_extra_edges"...

Yes, "extra" in the name makes it very understandable.