Re: [GSOC] Blog about week 6

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> writes:

> Hello everyone!
>
> Over the last week, I started implementing generation number v2. Based
> on the performance numbers and subsequent discussion, we have decided to
> use generation data chunk approach, storing topological levels into CDAT
> and corrected commit date offsets into GDAT chunk. Do note that we are
> maintaining backward compatibility using topological level, not with
> monotonic offsets.
>
> I also found a performance regression with `commit-graph write
> --reachable --changed_path` that sneaked in along with the patch on
> moving generation and graph positions into commit-slab.
>
> https://github.com/gitgitgadget/git/pull/676
>
> Yes, I pushed without signing off or a cover letter. It's still a work
> in progress.

All right.  Did you send this series of patches to git mailing list for
a review already, or is it only a working version?

>
> https://abhishekkumar2718.github.io/gsoc/2020/07/12/gsoc-week-6.html

Reply to the blog contents follows:

----------------------------------------------------------------------

> GSoC - Weeks 6
> ==============

Very minor nitpick / correction - I think you meant "Week" not "Weeks"
here in the title of this blog.

> Over the last week, I worked on implementing corrected commit date
> with generation data chunk and boy, was it painful!
>
> While working on the prototype, I used the space allocated for graph
> position to store topological level before writing it into CDAT chunk.

First, I assume that if we choose to not compute topological levels and
simply store GENERATION_NUMBER_MAX in the 30-bits in CDAT set for
generation number v1 we would not have this problem, isn't it?

Second, isn't `graph_pos` nowadays on the commit slab? Why reuse it,
which can lead to problems if other parts of code assume things about
its syntax, rather than making use of _new_ uint32_t `topo_level` commit
slab?  We have many different ad-hoc entries stored on the slab, what's
one more...

> However, while computing bloom filters, we call `get_bloom_filter()`.

Ah, I guess this happens when `git commit-graph write` is called with
the `--changed-paths` option, so we compute both generation numbers v1
and v2, and Bloom filters for changed paths.

> This function checks if the commit belongs to the commit-graph. If it
> does, load the bloom filters (avoiding a lot of computation).
> Otherwise, compute the filters and return it. Since I am storing
> topological levels in graph position, Git mistakenly thinks that we
> are reading a graph and tries to load bloom filter, segfaulting right
> away.

So don't do that; don't store topological level in graph position.

> Faced with the problem, I could think of three ideas:
>

> 1. Extend generation to be 64-bits (which we were planning on anyway)
>    and store topological levels within higher 32 bits and corrected date
>    offsets within lower 32 bits. Code for calculating corrected date
>    offsets, topological levels would be littered with bit shifts and
>    ANDs, but code for parsing and using generation numbers would be
>    clean.

If I remember it correctly, extending `generation` to 64-bits in
`struct commit_graph_data` (stored on slab) is there to store and
operate on the corrected commit(ter) date, instead of storing and
fiddling with offsets?

>
> 2. Use two 32-bits variables, level and odate, and use the contiguous
>    64 bits to store generation (while reading commit-graph): Code for
>    writing commit-graph is clean, but reading generation is much trickier
>    as we try to coerce value stored in (graph_data + 4) into a timestamp.

I don't quite get this idea.  For backward compatibility we have to
store topological levels in appropriate 30-bit wide field in CDAT chunk.
Storing topological levels in higher 32-bits of 64-bit GDAT chunk would
not provide it.

> 3. Split get_bloom_filters() into two functions - First checks if the
>    commit is from the graph and tries to load whereas the other computes
>    bloom filter. Then we could directly call the second function when
>    writing a commit-graph. I was not sure if this would have worked, but
>    I wanted not to change things unless required.

4. Store topological level data in a separate commit slab entry.
   This way any code that examines `commit_graph_data.position`,
   like `get_bloom_filter()`, wouldn't be mislead.

About the 4th possibility: does Git use both graph position and
generation number in the same process?  Because if it is not, then
perhaps putting graph_position and graph_generation_number in a
composite type `struct commit_graph_data` on slab might have been a
mistake.  Due to how modern CPUs work (with data prefetching and
pipelining) it might be more advantageous wrt performance to have
subsequent data for graph position or for generation number to be packed
and not strides, as it is now -- at least when accessing commits in the
commit.index order.

> I felt the first two approaches were too unreadable.
>

I think the 3rd approach is worth doing on its own, as it makes the code
bit clearer... but I have not actually examined the commits in the pull
request.

> In the end, I compromised by using a 64-bit generation and a 32-bit
> level in the initial commit and will restrict the ugly conversion to
> reuse 64-bits in a focused patch later in the series.

I would have to examine the patch in question in more detail, but I am
waiting for the series to be posted to Git mailing list (though I can be
persuaded to do pre-review on GitHub pull request, as pre-release step).

> Regression when computing bloom filters
> =======================================
>
> While wrestling with the problem above, I noticed that
> `commit_gen_cmp()` lies in “commit-graph write” path but uses the
> helpers. But the helper would always return `GENERATION_NUMBER_MAX`
> as we found while moving generation number into a slab.
>
> Digging through git logs, I found [the patch][1], which introduced
> `commit_gen_cmp()`. Sorting commits make the computing bloom filters
> much, much faster. The helper returns the same value every time, which
> makes sorting pointless and nullifies the performance boost. Fixing
> this was easy, bypass the helper and access generation number.
>
> [1]: https://github.com/git/git/commit/3d11275505694ce4e5256516de1c5dd90e749303

Good catch!

Best,
--
Jakub Narębski




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux