On Mon, Jun 20, 2022 at 12:33:10PM +0000, Abhradeep Chakraborty via GitGitGadget wrote: > From: Abhradeep Chakraborty <chakrabortyabhradeep79@xxxxxxxxx> > > Bitmap lookup table extension can let git to parse only the necessary > bitmaps without loading the previous bitmaps one by one. > > Teach git to read and use the bitmap lookup table extension. > > Co-Authored-by: Taylor Blau <ttaylorr@xxxxxxxxxx> > Mentored-by: Taylor Blau <ttaylorr@xxxxxxxxxx> > Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@xxxxxxxxx> > Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@xxxxxxxxx> > --- > pack-bitmap.c | 172 ++++++++++++++++++++++++++++++++++++++++++++++++-- > pack-bitmap.h | 1 + > 2 files changed, 166 insertions(+), 7 deletions(-) > > diff --git a/pack-bitmap.c b/pack-bitmap.c > index 36134222d7a..d5e5973a79f 100644 > --- a/pack-bitmap.c > +++ b/pack-bitmap.c > @@ -15,6 +15,7 @@ > #include "list-objects-filter-options.h" > #include "midx.h" > #include "config.h" > +#include "hash-lookup.h" > > /* > * An entry on the bitmap index, representing the bitmap for a given > @@ -82,6 +83,13 @@ struct bitmap_index { > /* The checksum of the packfile or MIDX; points into map. */ > const unsigned char *checksum; > > + /* > + * If not NULL, these point into the various commit table sections > + * (within map). > + */ > + unsigned char *table_lookup; > + unsigned char *table_offsets; > + If table_offsets ends up being a list of just offsets, we could assign this to the appropriate type, e.g., 'uint64_t *'. We would want to avoid using a type whose width is platform dependent, like off_t. But if you end up taking my suggestion from a previous response (of making each entry in the offset table a triple of commit, offset, and xor position), make sure to _not_ get tempted to define a struct and assign table_lookup to be a pointer of that structure type. That's because even though the struct *should* be packed as you expect, the packing is mostly up to the compiler, so you can't guarantee its members won't have padding between them or at the end of the struct for alignment purposes. > /* > * Extended index. > * > @@ -185,6 +193,24 @@ static int load_bitmap_header(struct bitmap_index *index) > index->hashes = (void *)(index_end - cache_size); > index_end -= cache_size; > } > + > + if (flags & BITMAP_OPT_LOOKUP_TABLE && > + git_env_bool("GIT_READ_COMMIT_TABLE", 1)) { What is the purpose of the GIT_READ_COMMIT_TABLE environment variable? I assume that it's to make it easier to run tests (especially performance ones) with and without access to the lookup table. If so, we should document that (lightly) in the commit message, and rename this to be GIT_TEST_READ_COMMIT_TABLE to indicate that it shouldn't be used outside of tests. > + uint32_t entry_count = ntohl(header->entry_count); > + uint32_t table_size = > + (entry_count * the_hash_algo->rawsz) /* oids */ + > + (entry_count * sizeof(uint32_t)) /* offsets */ + > + (entry_count * sizeof(uint32_t)) /* xor offsets */ + > + (sizeof(uint32_t)) /* flags */; entry_count is definitely a 4-byte integer, so uint32_t is the right type. But I think table_size should be a size_t, and computations on it should be more strictly checked. Perhaps something like; size_t table_size = sizeof(uint32_t); /* flags */ table_size = st_add(table_size, st_mult(entry_count, the_hash_algo->rawsz)); /* oids */ table_size = st_add(table_size, st_mult(entry_count, sizeof(uint32_t))); /* offsets */ table_size = st_add(table_size, st_mult(entry_count, sizeof(uint32_t))); /* xor offsets */ or even: size_t table_size = sizeof(uint32_t); /* flags */ table_size = st_add(table_size, st_mult(entry_count, the_hash_algo->rawsz + /* oids */ sizeof(uint32_t) + /* offsets*/ sizeof(uint32_t) /* xor offsets */ )); > + if (table_size > index_end - index->map - header_size) > + return error("corrupted bitmap index file (too short to fit commit table)"); > + > + index->table_lookup = (void *)(index_end - table_size); > + index->table_offsets = index->table_lookup + the_hash_algo->rawsz * entry_count; > + > + index_end -= table_size; > + } Looks good. > @@ -470,7 +496,7 @@ static int load_bitmap(struct bitmap_index *bitmap_git) > !(bitmap_git->tags = read_bitmap_1(bitmap_git))) > goto failed; > > - if (load_bitmap_entries_v1(bitmap_git) < 0) > + if (!bitmap_git->table_lookup && load_bitmap_entries_v1(bitmap_git) < 0) > goto failed; No need to load each of the bitmaps individually via load_bitmap_entries_v1() if we have a lookup table. That function doesn't do any other initialization that we depend on, so it's OK to just avoid calling it altogether. > return 0; > @@ -557,14 +583,145 @@ struct include_data { > struct bitmap *seen; > }; > > -struct ewah_bitmap *bitmap_for_commit(struct bitmap_index *bitmap_git, > - struct commit *commit) > +static struct stored_bitmap *stored_bitmap_for_commit(struct bitmap_index *bitmap_git, > + struct commit *commit, > + uint32_t *pos_hint); > + > +static inline const unsigned char *bitmap_oid_pos(struct bitmap_index *bitmap_git, > + uint32_t pos) > +{ > + return bitmap_git->table_lookup + (pos * the_hash_algo->rawsz); > +} > + > +static inline const void *bitmap_offset_pos(struct bitmap_index *bitmap_git, > + uint32_t pos) > +{ > + return bitmap_git->table_offsets + (pos * 2 * sizeof(uint32_t)); > +} > + > +static inline const void *xor_position_pos(struct bitmap_index *bitmap_git, > + uint32_t pos) > +{ > + return (unsigned char*) bitmap_offset_pos(bitmap_git, pos) + sizeof(uint32_t); > +} > + > +static int bitmap_lookup_cmp(const void *_va, const void *_vb) > +{ > + return hashcmp(_va, _vb); > +} All makes sense. Some light documentation might help explain what this comparator function is used for (the bsearch() call below in bitmap_table_lookup()), although I suspect that this function will get slightly more complicated if you pack the table contents as I suggest, in which case more documentation will definitely help. > + > +static int bitmap_table_lookup(struct bitmap_index *bitmap_git, > + struct object_id *oid, > + uint32_t *commit_pos) > +{ > + unsigned char *found = bsearch(oid->hash, bitmap_git->table_lookup, > + bitmap_git->entry_count, > + the_hash_algo->rawsz, bitmap_lookup_cmp); > + if (found) > + *commit_pos = (found - bitmap_git->table_lookup) / the_hash_algo->rawsz; If you end up chaning the type of bitmap_git->table_lookup, make sure that you scale the result of the pointer arithmetic accordingly, or cast down to an 'unsigned char *' before you do any math. > + return !!found; > +} > + > +static struct stored_bitmap *lazy_bitmap_for_commit(struct bitmap_index *bitmap_git, > + struct object_id *oid, > + uint32_t commit_pos) > +{ > + uint32_t xor_pos; > + off_t bitmap_ofs; > + > + int flags; > + struct ewah_bitmap *bitmap; > + struct stored_bitmap *xor_bitmap; > + > + bitmap_ofs = get_be32(bitmap_offset_pos(bitmap_git, commit_pos)); > + xor_pos = get_be32(xor_position_pos(bitmap_git, commit_pos)); > + > + /* > + * Lazily load the xor'd bitmap if required (and we haven't done so > + * already). Make sure to pass the xor'd bitmap's position along as a > + * hint to avoid an unnecessary binary search in > + * stored_bitmap_for_commit(). > + */ > + if (xor_pos == 0xffffffff) { > + xor_bitmap = NULL; > + } else { > + struct commit *xor_commit; > + struct object_id xor_oid; > + > + oidread(&xor_oid, bitmap_oid_pos(bitmap_git, xor_pos)); Interesting; this is a point that I forgot about from the original patch. xor_pos is an index (not an offset) into the list of commits in the table of contents in the order appear in that table. We should be clear about (a) what that order is, and (b) that xor_pos is an index into that order. The rest of this function looks good to me. > +static struct stored_bitmap *stored_bitmap_for_commit(struct bitmap_index *bitmap_git, > + struct commit *commit, > + uint32_t *pos_hint) > { > khiter_t hash_pos = kh_get_oid_map(bitmap_git->bitmaps, > commit->object.oid); > - if (hash_pos >= kh_end(bitmap_git->bitmaps)) > + if (hash_pos >= kh_end(bitmap_git->bitmaps)) { > + uint32_t commit_pos; > + if (!bitmap_git->table_lookup) > + return NULL; I was going to suggest moving this check into the caller bitmap_for_commit() and making it a BUG() to call stored_bitmap_for_commit() with a NULL bitmap_git->table_lookup pointer. And I think this makes sense... if we return NULL here, then we know that we definitely don't have a stored bitmap, since there's no table to look it up in and we have already loaded everything else. So we propagate that NULL to the return value of bitmap_for_commit(), and that makes sense. Good. > + /* NEEDSWORK: cache misses aren't recorded. */ Yeah. The problem here is that we can't record every commit that _doesn't_ have a bitmap every time we return NULL from one of these queries, since there are arbitrarily many such commits that don't have bitmaps. We could approximate it using a Bloom filter or something, and much of that code is already written and could be interesting to try and reuse. But I wonder if we could get by with something simpler, though, which would cause us to load all bitmaps from the lookup table after a fixed number of cache misses (at which point we should force ourselves to load everything and just read everything out of a single O(1) lookup in the stored bitmap table). That may or may not be a good idea, and the threshold will probably be highly dependent on the system. So it may not even be worth it, but I think it's an interesting area to experiemnt in and think a little more about. > + if (pos_hint) > + commit_pos = *pos_hint; How does this commit_pos work again? I confess I have forgetten since I wrote some of this code a while ago... :-). > @@ -1699,8 +1856,9 @@ void test_bitmap_walk(struct rev_info *revs) > if (revs->pending.nr != 1) > die("you must specify exactly one commit to test"); > > - fprintf(stderr, "Bitmap v%d test (%d entries loaded)\n", > - bitmap_git->version, bitmap_git->entry_count); > + if (!bitmap_git->table_lookup) > + fprintf(stderr, "Bitmap v%d test (%d entries loaded)\n", > + bitmap_git->version, bitmap_git->entry_count); Should we print this regardless of whether or not there is a lookup table? We should be able to learn the entry count either way. Thanks, Taylor