On 6/20/2022 8:33 AM, Abhradeep Chakraborty via GitGitGadget wrote: > From: Abhradeep Chakraborty <chakrabortyabhradeep79@xxxxxxxxx> > > Bitmap lookup table extension can let git to parse only the necessary > bitmaps without loading the previous bitmaps one by one. Here is an attempt to reword this a bit: The bitmap lookup table extension was documented by an earlier change, but Git does not yet know how to parse that information. The extension allows parsing a smaller portion of the bitmap file in order to find bitmaps for specific commits. > Teach git to read and use the bitmap lookup table extension. Normally, I don't mind doing the read portion after the write portion, but it would be nice to have them in the opposite order so we can test writing the extension _and Git ignoring the extension_ before implementing the parsing. As it stands, most of the code in this patch is untested until patch 5. General outline attempt: 1. Document the format. 2. Write the extension if the flag is given. 3. Add pack.writeBitmapLookupTable and add tests that write the lookup table (and do other bitmap reads on that data). 4. Read the lookup table. The tests from step 3 already cover acting upon the lookup table. (Perhaps we add a mode here that disables GIT_READ_COMMIT_TABLE since that is not used anywhere else.) 5. Performance tests. > + if (flags & BITMAP_OPT_LOOKUP_TABLE && > + git_env_bool("GIT_READ_COMMIT_TABLE", 1)) { This environment variable does not appear to be used or documented anywhere. Do we really want to use it as a way to disable reading the lookup table in general? Or would it be better to have a GIT_TEST_* variable for disabling the read during testing? > + uint32_t entry_count = ntohl(header->entry_count); > + uint32_t table_size = > + (entry_count * the_hash_algo->rawsz) /* oids */ + > + (entry_count * sizeof(uint32_t)) /* offsets */ + > + (entry_count * sizeof(uint32_t)) /* xor offsets */ + > + (sizeof(uint32_t)) /* flags */; Here, uint32_T is probably fine, but maybe we should just use size_t instead? Should we use st_mult() and st_add() everywhere? Note: you're using the_hash_algo->rawsz here, which makes sense because the bitmap format doesn't specify which hash algorithm is used. Just making this note to say that we should include the hash algorithm as a value in the bitmap format when we increment the format version (in the future). > + if (table_size > index_end - index->map - header_size) > + return error("corrupted bitmap index file (too short to fit commit table)"); > + > + index->table_lookup = (void *)(index_end - table_size); > + index->table_offsets = index->table_lookup + the_hash_algo->rawsz * entry_count; st_mult(), st_add()? Or, should we assume safety now? > + index_end -= table_size; > + } > - if (load_bitmap_entries_v1(bitmap_git) < 0) > + if (!bitmap_git->table_lookup && load_bitmap_entries_v1(bitmap_git) < 0) > goto failed; Ok, don't load these entries pre-emptively if we have the lookup table. > +static struct stored_bitmap *stored_bitmap_for_commit(struct bitmap_index *bitmap_git, > + struct commit *commit, > + uint32_t *pos_hint); I see that we have a two-method recursion loop. Please move this declaration to immediately before lazy_bitmap_for_commit() so it is declared as late as possible. > +static inline const unsigned char *bitmap_oid_pos(struct bitmap_index *bitmap_git, > + uint32_t pos) > +{ > + return bitmap_git->table_lookup + (pos * the_hash_algo->rawsz); > +} I would call this "bitmap_hash_pos()" because we are getting a raw hash and not a 'struct object_id'. Do you want a helper that fills a 'struct object_id', perhaps passed-by-reference? > +static inline const void *bitmap_offset_pos(struct bitmap_index *bitmap_git, > + uint32_t pos) > +static inline const void *xor_position_pos(struct bitmap_index *bitmap_git, > + uint32_t pos) These two helpers should probably return a size_t and uint32_t instead of a pointer. Let these do get_be[32|64]() on the computed pointer. > +static int bitmap_table_lookup(struct bitmap_index *bitmap_git, > + struct object_id *oid, > + uint32_t *commit_pos) > +{ > + unsigned char *found = bsearch(oid->hash, bitmap_git->table_lookup, > + bitmap_git->entry_count, > + the_hash_algo->rawsz, bitmap_lookup_cmp); > + if (found) > + *commit_pos = (found - bitmap_git->table_lookup) / the_hash_algo->rawsz; > + return !!found; Ok, we are running binary search and converting the pointer into a position. Frequently, these kind of searches return an int, but use a negative value to indicate that the value was not found. Using an int in this way would restrict us to 2^31 bitmaps instead of 2^32, so maybe it is not worth matching that practice. > +static struct stored_bitmap *lazy_bitmap_for_commit(struct bitmap_index *bitmap_git, > + struct object_id *oid, > + uint32_t commit_pos) > +{ > + uint32_t xor_pos; > + off_t bitmap_ofs; > + > + int flags; > + struct ewah_bitmap *bitmap; > + struct stored_bitmap *xor_bitmap; > + > + bitmap_ofs = get_be32(bitmap_offset_pos(bitmap_git, commit_pos)); > + xor_pos = get_be32(xor_position_pos(bitmap_git, commit_pos)); These lines become simpler with a change in the helper methods' prototypes, as I recommended higher up. > + /* > + * Lazily load the xor'd bitmap if required (and we haven't done so > + * already). Make sure to pass the xor'd bitmap's position along as a > + * hint to avoid an unnecessary binary search in > + * stored_bitmap_for_commit(). > + */ > + if (xor_pos == 0xffffffff) { > + xor_bitmap = NULL; > + } else { > + struct commit *xor_commit; > + struct object_id xor_oid; > + > + oidread(&xor_oid, bitmap_oid_pos(bitmap_git, xor_pos)); > + > + xor_commit = lookup_commit(the_repository, &xor_oid); > + if (!xor_commit) > + return NULL; > + > + xor_bitmap = stored_bitmap_for_commit(bitmap_git, xor_commit, > + &xor_pos); > + } This is using an interesting type of tail-recursion. We might be better off using a loop with a stack: push to the stack the commit positions of the XOR bitmaps. At the very bottom, we get a bitmap without an XOR base. Then, pop off the stack, modifying the bitmap with XOR operations as we go. (Perhaps we also store these bitmaps in-memory along the way?) Finally, we have the necessary bitmap. This iterative approach avoids possible stack exhaustion if there are long XOR chains in the file. > + > + /* > + * Don't bother reading the commit's index position or its xor > + * offset: > + * > + * - The commit's index position is irrelevant to us, since > + * load_bitmap_entries_v1 only uses it to learn the object > + * id which is used to compute the hashmap's key. We already > + * have an object id, so no need to look it up again. > + * > + * - The xor_offset is unusable for us, since it specifies how > + * many entries previous to ours we should look at. This > + * makes sense when reading the bitmaps sequentially (as in > + * load_bitmap_entries_v1()), since we can keep track of > + * each bitmap as we read them. > + * > + * But it can't work for us, since the bitmap's don't have a > + * fixed size. So we learn the position of the xor'd bitmap > + * from the commit table (and resolve it to a bitmap in the > + * above if-statement). > + * > + * Instead, we can skip ahead and immediately read the flags and > + * ewah bitmap. > + */ > + bitmap_git->map_pos = bitmap_ofs + sizeof(uint32_t) + sizeof(uint8_t); > + flags = read_u8(bitmap_git->map, &bitmap_git->map_pos); > + bitmap = read_bitmap_1(bitmap_git); > + if (!bitmap) > + return NULL; > + > + return store_bitmap(bitmap_git, bitmap, oid, xor_bitmap, flags); Looks like we'd want to call store_bitmap() while popping the stack in the loop I recommended above. > +} > + > +static struct stored_bitmap *stored_bitmap_for_commit(struct bitmap_index *bitmap_git, > + struct commit *commit, > + uint32_t *pos_hint) > { > khiter_t hash_pos = kh_get_oid_map(bitmap_git->bitmaps, > commit->object.oid); > - if (hash_pos >= kh_end(bitmap_git->bitmaps)) > + if (hash_pos >= kh_end(bitmap_git->bitmaps)) { > + uint32_t commit_pos; > + if (!bitmap_git->table_lookup) > + return NULL; > + > + /* NEEDSWORK: cache misses aren't recorded. */ > + if (pos_hint) > + commit_pos = *pos_hint; > + else if (!bitmap_table_lookup(bitmap_git, > + &commit->object.oid, > + &commit_pos)) > + return NULL; > + return lazy_bitmap_for_commit(bitmap_git, &commit->object.oid, > + commit_pos); The extra bonus of going incremental is that we don't have recursion across two methods, which I always find difficult to reason about. > + } > + return kh_value(bitmap_git->bitmaps, hash_pos); > +} > @@ -26,6 +26,7 @@ struct bitmap_disk_header { > enum pack_bitmap_opts { > BITMAP_OPT_FULL_DAG = 1, > BITMAP_OPT_HASH_CACHE = 4, > + BITMAP_OPT_LOOKUP_TABLE = 16, Perhaps it is time to use hexadecimal representation here to match the file format document? Thanks, -Stolee