On Mon, Jun 20, 2022 at 12:56:27PM -0400, Derrick Stolee wrote: > On 6/20/2022 8:33 AM, Abhradeep Chakraborty via GitGitGadget wrote: > > From: Abhradeep Chakraborty <chakrabortyabhradeep79@xxxxxxxxx> > > > > When reading bitmap file, git loads each and every bitmap one by one > > even if all the bitmaps are not required. A "bitmap lookup table" > > extension to the bitmap format can reduce the overhead of loading > > bitmaps which stores a list of bitmapped commit oids, along with their > > offset and xor offset. This way git can load only the neccesary bitmaps > > without loading the previous bitmaps. > > > > Add some information for the new "bitmap lookup table" extension in the > > bitmap-format documentation. > > > > @@ -67,6 +67,14 @@ MIDXs, both the bit-cache and rev-cache extensions are required. > > pack/MIDX. The format and meaning of the name-hash is > > described below. > > > > + ** {empty} > > + BITMAP_OPT_LOOKUP_TABLE (0xf) : ::: > > + If present, the end of the bitmap file contains a table > > + containing a list of `N` object ids, a list of pairs of > > + offset and xor offset of respective objects, and 4-byte > > + integer denoting the flags (currently none). The format > > + and meaning of the table is described below. > > + > > Here, you are adding a new flag that indicates that the end of the file > contains this extra extension. This works because the size of the > extension is predictable. As long as any future extensions are also of > a predictable size, then we can continue adding them via flags in this > way. Right; any extensions that are added to the existing .bitmap format must have a size that is predictable in order for readers to locate the next extension, if any. > This is better than updating the full file format to do something like > like use the chunk format API, especially because this format is shared > across other tools (JGit being mentioned frequently). Agreed. Abhradeep and I discussed whether or not it was worth exploring a new .bitmap format, and the consensus we reached was that it may be required in the future (if we explored a compression scheme other than EWAH or made some other backwards-incompatible change), but as of yet it isn't necessary. So we avoided it to eliminate unnecessary churn, especially of on-disk formats. > It might be worth mentioning in your commit message what happens when an > older version of Git (or JGit) notices this flag. Does it refuse to > operate on the .bitmap file? Does it give a warning or die? It would be > nice if this extension could be ignored (it seems like adding the extra > data at the end does not stop the bitmap data from being understood). I agree. The bitmap reader does not warn or die when it sees unrecognized extensions, that way new extensions can be added without rendering all previously-written bitmaps useless. But in order to understand an extension on bit N, the reader must also understand extensions N-1, N-2, and so on (in order to locate the end of extension N). > > + - `nr_entries` pairs of 4-byte integers, each in network order. > > + The first holds the offset from which that commit's bitmap can > > + be read. The second number holds the position of the commit > > + whose bitmap the current bitmap is xor'd with in lexicographic > > + order, or 0xffffffff if the current commit is not xor'd with > > + anything. > > Interesting to give the xor chains directions here. You say "position" > here for the second commit: do you mean within the list of object names > as opposed to the offset? That would make the most sense so we can trace > the full list of XORs we need to make all at once. > > Are .bitmap files already constrained to 4GB, so these 32-bit offsets > make sense? Using 64-bit offsets would be a small cost here, I think, > without needing to do any fancy "overflow" tables that could introduce > a variable-length extension. Yeah, we should support >4GB bitmaps here. An overflow table could work, but I agree with Stolee that in practice it won't matter. Most .bitmap files that I've looked at in the wild have around ~500 entries at most, and are usually small. So the cost of widening this section isn't a big deal. But note that the entry count is only one component of the bitmap size: the individual entry lengths obviously matter too. And in repositories whose bitmaps exceed 500 entries, the entries themselves are often several million bits long (before compression) already. So it is certainly possible to exceed 4GB without having an astronomical entry count. So doubling the width of this extension might add an extra 250 KiB or so, which is negligible. I would much rather see us do that in cases where it makes sense (small number of entries, minimal cost to wider records, etc.) than adding unnecessary complexity via an extra lookup table for >4GB offsets. > > + - One 4-byte network byte order integer specifying > > + table-specific flags. None exist currently, so this is always > > + "0". > > I'm guessing this is at the end of the extension because a future flag > could modify the length of the extension, so we need the flags to be > in a predictable location. Could we make that clear somewhere? I can't remember what I had on my mind when I wrote this ;-). Abhradeep -- do you have any thoughts about what this might be used for? I'll try to remember it myself, but I imagine that we could just as easily remove this altogether and avoid the confusion. > How does Git react to seeing flags here that it does not recognize? > It seems that Git should ignore the lookup table but continue using the > rest of the .bitmap file as it did before, yes? (See above). Thanks, Taylor