Re: [PATCH 1/6] Documentation/technical: describe bitmap lookup table extension

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jun 20, 2022 at 12:33:09PM +0000, Abhradeep Chakraborty via GitGitGadget wrote:
> From: Abhradeep Chakraborty <chakrabortyabhradeep79@xxxxxxxxx>
>
> When reading bitmap file, git loads each and every bitmap one by one
> even if all the bitmaps are not required. A "bitmap lookup table"
> extension to the bitmap format can reduce the overhead of loading
> bitmaps which stores a list of bitmapped commit oids, along with their
> offset and xor offset. This way git can load only the neccesary bitmaps
> without loading the previous bitmaps.

Well put. It might help to have a concrete example of where we expect
this to help and not help. I suspect that some of this will show up in
your work updating the perf suite to use this new table, but I imagine
that we'll find something like:

    In cases where the result can be read or computed without
    significant additional traversal (e.g., all commits of interest
    already have bitmaps computed), we can save some time loading and
    parsing a majority of the bitmap file that we will never read.

    But in cases where the bitmaps are out-of-date, or there is
    significant traversal required to go from the reference tips to
    what's contained in the .bitmap file, this table provides minimal
    benefit (or something).

Of course, you should verify that that is actually true before we insert
it into the commit message as such ;-). But that sort of information may
help readers understand what the purpose of this change is towards the
beinning of the series.

> Add some information for the new "bitmap lookup table" extension in the
> bitmap-format documentation.
>
> Co-Authored-by: Taylor Blau <ttaylorr@xxxxxxxxxx>
> Mentored-by: Taylor Blau <ttaylorr@xxxxxxxxxx>

Here and elsewhere: I typically use my <me@xxxxxxxxxxxx> address when
contributing to Git. So any trailers that mention my email or commits
that you send on my behalf should use that address, too.

> Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@xxxxxxxxx>
> Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@xxxxxxxxx>
> ---
>  Documentation/technical/bitmap-format.txt | 31 +++++++++++++++++++++++
>  1 file changed, 31 insertions(+)
>
> diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt
> index 04b3ec21785..34e98787b78 100644
> --- a/Documentation/technical/bitmap-format.txt
> +++ b/Documentation/technical/bitmap-format.txt
> @@ -67,6 +67,14 @@ MIDXs, both the bit-cache and rev-cache extensions are required.
>  			pack/MIDX. The format and meaning of the name-hash is
>  			described below.
>
> +			** {empty}
> +			BITMAP_OPT_LOOKUP_TABLE (0xf) : :::

It the space between "(0xf)" and the first ":" intentional? Similarly,
should there be two or three colons at the end (either "::" or ":::")?

> +			If present, the end of the bitmap file contains a table
> +			containing a list of `N` object ids, a list of pairs of
> +			offset and xor offset of respective objects, and 4-byte
> +			integer denoting the flags (currently none). The format
> +			and meaning of the table is described below.
> +

I remember we had a brief off-list discussion about whether we should
store the full object IDs in the offset table, or whether we could store
their pack- or index-relative ordering. Is there a reason to prefer one
or the other?

I don't think we need to explain the choice fully in the documentation
in this patch, but it may be worth thinking about separately
nonetheless. We can store either order and convert it to an object ID in
constant time.

To figure out which is best, I would recommend trying a few different
choices here and seeing how they do or don't impact your performance
testing.

>  		4-byte entry count (network byte order)
>
>  			The total count of entries (bitmapped commits) in this bitmap index.
> @@ -205,3 +213,26 @@ Note that this hashing scheme is tied to the BITMAP_OPT_HASH_CACHE flag.
>  If implementations want to choose a different hashing scheme, they are
>  free to do so, but MUST allocate a new header flag (because comparing
>  hashes made under two different schemes would be pointless).
> +
> +Commit lookup table
> +-------------------
> +
> +If the BITMAP_OPT_LOOKUP_TABLE flag is set, the end of the `.bitmap`
> +contains a lookup table specifying the positions of commits which have a
> +bitmap.
> +
> +For a `.bitmap` containing `nr_entries` reachability bitmaps, the format
> +is as follows:
> +
> +	- `nr_entries` object names.
> +
> +	- `nr_entries` pairs of 4-byte integers, each in network order.
> +	  The first holds the offset from which that commit's bitmap can
> +	  be read. The second number holds the position of the commit
> +	  whose bitmap the current bitmap is xor'd with in lexicographic
> +	  order, or 0xffffffff if the current commit is not xor'd with
> +	  anything.

A couple of small thoughts here. I wonder if we'd get better locality if
we made each record look something like:

    (object_id, offset, xor_pos)

Where object_id is either 20- or 4-bytes long (depending if we store the
full object ID, or some 4-byte identifier that allows us to discover
it), offset is 8 bytes long, and xor_pos is 4-bytes (since in practice
we don't support packs or MIDXs which have more than 2^32-1 objects).

In the event that this table doesn't fit into a single cache line, I
think we'll get better performance out of reading it by not forcing the
cache to evict itself whenever we need to refer back to the object_id.

> +	- One 4-byte network byte order integer specifying
> +	  table-specific flags. None exist currently, so this is always
> +	  "0".

I mentioned in my reply to Stolee earlier, but I think that we should
either (a) try to remember what this is for and document it, or (b)
remove it.

Thanks,
Taylor



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux