Re: [PATCH 1/6] Documentation/technical: describe bitmap lookup table extension

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jun 20, 2022 at 12:56:27PM -0400, Derrick Stolee wrote:
> On 6/20/2022 8:33 AM, Abhradeep Chakraborty via GitGitGadget wrote:
> > From: Abhradeep Chakraborty <chakrabortyabhradeep79@xxxxxxxxx>
> >
> > When reading bitmap file, git loads each and every bitmap one by one
> > even if all the bitmaps are not required. A "bitmap lookup table"
> > extension to the bitmap format can reduce the overhead of loading
> > bitmaps which stores a list of bitmapped commit oids, along with their
> > offset and xor offset. This way git can load only the neccesary bitmaps
> > without loading the previous bitmaps.
> >
> > Add some information for the new "bitmap lookup table" extension in the
> > bitmap-format documentation.
>
>
> > @@ -67,6 +67,14 @@ MIDXs, both the bit-cache and rev-cache extensions are required.
> >  			pack/MIDX. The format and meaning of the name-hash is
> >  			described below.
> >
> > +			** {empty}
> > +			BITMAP_OPT_LOOKUP_TABLE (0xf) : :::
> > +			If present, the end of the bitmap file contains a table
> > +			containing a list of `N` object ids, a list of pairs of
> > +			offset and xor offset of respective objects, and 4-byte
> > +			integer denoting the flags (currently none). The format
> > +			and meaning of the table is described below.
> > +
>
> Here, you are adding a new flag that indicates that the end of the file
> contains this extra extension. This works because the size of the
> extension is predictable. As long as any future extensions are also of
> a predictable size, then we can continue adding them via flags in this
> way.

Right; any extensions that are added to the existing .bitmap format must
have a size that is predictable in order for readers to locate the next
extension, if any.

> This is better than updating the full file format to do something like
> like use the chunk format API, especially because this format is shared
> across other tools (JGit being mentioned frequently).

Agreed. Abhradeep and I discussed whether or not it was worth exploring
a new .bitmap format, and the consensus we reached was that it may be
required in the future (if we explored a compression scheme other than
EWAH or made some other backwards-incompatible change), but as of yet it
isn't necessary. So we avoided it to eliminate unnecessary churn,
especially of on-disk formats.

> It might be worth mentioning in your commit message what happens when an
> older version of Git (or JGit) notices this flag. Does it refuse to
> operate on the .bitmap file? Does it give a warning or die? It would be
> nice if this extension could be ignored (it seems like adding the extra
> data at the end does not stop the bitmap data from being understood).

I agree. The bitmap reader does not warn or die when it sees
unrecognized extensions, that way new extensions can be added without
rendering all previously-written bitmaps useless. But in order to
understand an extension on bit N, the reader must also understand
extensions N-1, N-2, and so on (in order to locate the end of
extension N).

> > +	- `nr_entries` pairs of 4-byte integers, each in network order.
> > +	  The first holds the offset from which that commit's bitmap can
> > +	  be read. The second number holds the position of the commit
> > +	  whose bitmap the current bitmap is xor'd with in lexicographic
> > +	  order, or 0xffffffff if the current commit is not xor'd with
> > +	  anything.
>
> Interesting to give the xor chains directions here. You say "position"
> here for the second commit: do you mean within the list of object names
> as opposed to the offset? That would make the most sense so we can trace
> the full list of XORs we need to make all at once.
>
> Are .bitmap files already constrained to 4GB, so these 32-bit offsets
> make sense? Using 64-bit offsets would be a small cost here, I think,
> without needing to do any fancy "overflow" tables that could introduce
> a variable-length extension.

Yeah, we should support >4GB bitmaps here. An overflow table could work,
but I agree with Stolee that in practice it won't matter. Most .bitmap
files that I've looked at in the wild have around ~500 entries at most,
and are usually small. So the cost of widening this section isn't a big
deal.

But note that the entry count is only one component of the bitmap size:
the individual entry lengths obviously matter too. And in repositories
whose bitmaps exceed 500 entries, the entries themselves are often
several million bits long (before compression) already. So it is
certainly possible to exceed 4GB without having an astronomical entry
count.

So doubling the width of this extension might add an extra 250 KiB or
so, which is negligible.

I would much rather see us do that in cases where it makes sense (small
number of entries, minimal cost to wider records, etc.) than adding
unnecessary complexity via an extra lookup table for >4GB offsets.

> > +	- One 4-byte network byte order integer specifying
> > +	  table-specific flags. None exist currently, so this is always
> > +	  "0".
>
> I'm guessing this is at the end of the extension because a future flag
> could modify the length of the extension, so we need the flags to be
> in a predictable location. Could we make that clear somewhere?

I can't remember what I had on my mind when I wrote this ;-).

Abhradeep -- do you have any thoughts about what this might be used for?
I'll try to remember it myself, but I imagine that we could just as
easily remove this altogether and avoid the confusion.

> How does Git react to seeing flags here that it does not recognize?
> It seems that Git should ignore the lookup table but continue using the
> rest of the .bitmap file as it did before, yes?

(See above).

Thanks,
Taylor



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux