Derrick Stolee <stolee@xxxxxxxxx> writes: > jt/binsearch-with-fanout introduces one when there is a 256-entry > fanout table (not the case here). > > The bsearch() method in search.h (and used in > pack-write.c:need_large_offset) does not return the _position_ of a > found element. > > Neither of these suit my needs, but I could just be searching for the > wrong strings. Also, I could divert my energies in this area to create > a generic search in the style of jt/binsearch-with-fanout. ... me goes and digs ... What I had in mind was the one in sha1-lookup.c, actually. Having said that, hand-rolling another one is not too huge a deal; eventually people will notice and consolidate code after the series stabilizes anyway ;-) >>> + num_large_edges++; >>> + parent = parent->next; >>> + } while (parent); >> It feels somewhat wasteful to traverse the commit's parents list >> only to count, without populating the octopus table (which I >> understand is assumed to be minority case under this design). > > Since we are writing the commit graph file in-order, we cannot write > the octopus table until after the chunk lengths are known. Oh, my "minority case" comment was me wondering "since we expect there are only a few, why not keep them in memory while we discover them here, so that the writing phase that come later do not have to go through all commits again counting their parents? would it be more performant and a better trade-off?" We can certainly do such an optimization later (iow, it is not all that crucial issue and certainly I didn't mention the above as something that needs to be "fixed"--there is nothing broken). > store the octopus table in memory and then dump it into the file > later, but walking the parents is quite fast after all the commits are > loaded. I'm not sure the time optimization merits the extra complexity > here. (I'm happy to revisit this if we do see this performance > lacking.) > > P.S. I really like the name "octopus table" and will use that for > informal discussions of this format. I actually do not mind that name used as the official term. I find it far more descriptive and understandable than "long edge" / "large edge" at least to a Git person. > You're right that int_id isn't great, and your more-specific > "oid_table_pos" shows an extra reason why it isn't great: when we add > the GRAPH_LAST_EDGE bit or set it to GRAPH_PARENT_MISSING, the value > is NOT a table position. Perhaps I am somewhat biased, but it is quite natural for our codebase and internal API to say something like this: x_pos(table, key) function's return value is the non-negative position for the key in the table when the key is there; when it returns a negative value, it is (-1 - position) where the "position" is the position in the table they key would have been found if it was in there. and store the return value from such a function in a variable called "pos". Surely, sometimes "pos" does not have _the_ position, but that does not make it a bad name. Saying "MISSING is a special value that denotes 'nothing is here'" and allowing it to be set to a variable that meant to hold the position is not such a bad thing, though. After all, that is how you use NULL as a special value for a pointer variable ;-). Same for using the high bit to mean something else. Taking these together you would explain "low 31-bit of pos holds the position for the item in the table. MISSING is a special value that you can use to denote there is nothing. The LAST_EDGE bit denotes that one group of positions ends there", or something like that. > I think the current name makes the following call very clear: It is still a strange name nevertheless. >>> +char *write_commit_graph(const char *obj_dir) >>> +{ >>> + struct packed_oid_list oids; >>> + struct packed_commit_list commits; >>> + struct sha1file *f; >>> + int i, count_distinct = 0; >>> + DIR *info_dir; >>> + struct strbuf tmp_file = STRBUF_INIT; >>> + struct strbuf graph_file = STRBUF_INIT; >>> + unsigned char final_hash[GIT_MAX_RAWSZ]; >>> + char *graph_name; >>> + int fd; >>> + uint32_t chunk_ids[5]; >>> + uint64_t chunk_offsets[5]; >>> + int num_chunks; >>> + int num_long_edges; >>> + struct commit_list *parent; >>> + >>> + oids.nr = 0; >>> + oids.alloc = (int)(0.15 * approximate_object_count()); >> Heh, traditionalist would probably avoid unnecessary use of float >> and use something like 1/4 or 1/8 ;-) After all, it is merely a >> ballpark guestimate. >> >>> + num_long_edges = 0; >> This again is about naming, but I find it a bit unnatural to call >> the edge between a chind and its octopus parents "long". Individual >> edges are not long--the only thing that is long is your "list of >> edges". Some other codepaths in this patch seems to call the same >> concept with s/long/large/, which I found somewhat puzzling. > > How about "num_extra_edges"... Yes, "extra" in the name makes it very understandable.