Re: [GSOC] Blog about week 6

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Jul 18, 2020 at 12:49:11AM +0200, Jakub Narębski wrote:
> Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> writes:
> 
> > Hello everyone!
> >
> > Over the last week, I started implementing generation number v2. Based
> > on the performance numbers and subsequent discussion, we have decided to
> > use generation data chunk approach, storing topological levels into CDAT
> > and corrected commit date offsets into GDAT chunk. Do note that we are
> > maintaining backward compatibility using topological level, not with
> > monotonic offsets.
> >
> > I also found a performance regression with `commit-graph write
> > --reachable --changed_path` that sneaked in along with the patch on
> > moving generation and graph positions into commit-slab.
> >
> > https://github.com/gitgitgadget/git/pull/676
> >
> > Yes, I pushed without signing off or a cover letter. It's still a work
> > in progress.
> 
> All right.  Did you send this series of patches to git mailing list for
> a review already, or is it only a working version?
> 

It's a working version so far.

> >
> > https://abhishekkumar2718.github.io/gsoc/2020/07/12/gsoc-week-6.html
> 
> Reply to the blog contents follows:
> 
> ----------------------------------------------------------------------
> 
> > GSoC - Weeks 6
> > ==============
> 
> Very minor nitpick / correction - I think you meant "Week" not "Weeks"
> here in the title of this blog.
> 

Thanks! Will fix it.

> > Over the last week, I worked on implementing corrected commit date
> > with generation data chunk and boy, was it painful!
> >
> > While working on the prototype, I used the space allocated for graph
> > position to store topological level before writing it into CDAT chunk.
> 
> First, I assume that if we choose to not compute topological levels and
> simply store GENERATION_NUMBER_MAX in the 30-bits in CDAT set for
> generation number v1 we would not have this problem, isn't it?
> 
> Second, isn't `graph_pos` nowadays on the commit slab? Why reuse it,
> which can lead to problems if other parts of code assume things about
> its syntax, rather than making use of _new_ uint32_t `topo_level` commit
> slab?  We have many different ad-hoc entries stored on the slab, what's
> one more...
> 
> > However, while computing bloom filters, we call `get_bloom_filter()`.
> 
> Ah, I guess this happens when `git commit-graph write` is called with
> the `--changed-paths` option, so we compute both generation numbers v1
> and v2, and Bloom filters for changed paths.
> 
> > This function checks if the commit belongs to the commit-graph. If it
> > does, load the bloom filters (avoiding a lot of computation).
> > Otherwise, compute the filters and return it. Since I am storing
> > topological levels in graph position, Git mistakenly thinks that we
> > are reading a graph and tries to load bloom filter, segfaulting right
> > away.
> 
> So don't do that; don't store topological level in graph position.
> 
> > Faced with the problem, I could think of three ideas:
> >
> 
> > 1. Extend generation to be 64-bits (which we were planning on anyway)
> >    and store topological levels within higher 32 bits and corrected date
> >    offsets within lower 32 bits. Code for calculating corrected date
> >    offsets, topological levels would be littered with bit shifts and
> >    ANDs, but code for parsing and using generation numbers would be
> >    clean.
> 
> If I remember it correctly, extending `generation` to 64-bits in
> `struct commit_graph_data` (stored on slab) is there to store and
> operate on the corrected commit(ter) date, instead of storing and
> fiddling with offsets?
> 

Well, yes. We could ask for a new slab for topo_levels while writing the
graph. But since we are writing offsets (to minimize space used by the
graph), I store offsets through out `compute_generation_number`. Since
the offsets fit into 32-bits, I thought of reusing the higher 32-bits to
store topological levels.

> >
> > 2. Use two 32-bits variables, level and odate, and use the contiguous
> >    64 bits to store generation (while reading commit-graph): Code for
> >    writing commit-graph is clean, but reading generation is much trickier
> >    as we try to coerce value stored in (graph_data + 4) into a timestamp.
> 
> I don't quite get this idea.  For backward compatibility we have to
> store topological levels in appropriate 30-bit wide field in CDAT chunk.
> Storing topological levels in higher 32-bits of 64-bit GDAT chunk would
> not provide it.

Here, I am talking not about the how data is stored on chunks but rather
how it's stored in the commit-slab while writing commit-graph.

> 
> > 3. Split get_bloom_filters() into two functions - First checks if the
> >    commit is from the graph and tries to load whereas the other computes
> >    bloom filter. Then we could directly call the second function when
> >    writing a commit-graph. I was not sure if this would have worked, but
> >    I wanted not to change things unless required.
> 
> 4. Store topological level data in a separate commit slab entry.
>    This way any code that examines `commit_graph_data.position`,
>    like `get_bloom_filter()`, wouldn't be mislead.
> 
> About the 4th possibility: does Git use both graph position and
> generation number in the same process?  Because if it is not, then
> perhaps putting graph_position and graph_generation_number in a
> composite type `struct commit_graph_data` on slab might have been a
> mistake.  Due to how modern CPUs work (with data prefetching and
> pipelining) it might be more advantageous wrt performance to have
> subsequent data for graph position or for generation number to be packed
> and not strides, as it is now -- at least when accessing commits in the
> commit.index order.
>

I suppose so. Graph positions would be used to efficiently look up
generation number in the commit-graph, so Git uses both graph position
and generation number in the same process.

Although, since we store generation number in the anyway, we have no
further use for the graph position.

But I think Dr. Stolee would be the right person to answer this.

> > I felt the first two approaches were too unreadable.
> >
> 
> I think the 3rd approach is worth doing on its own, as it makes the code
> bit clearer... but I have not actually examined the commits in the pull
> request.
> 

Alright, will add this as well.

> > In the end, I compromised by using a 64-bit generation and a 32-bit
> > level in the initial commit and will restrict the ugly conversion to
> > reuse 64-bits in a focused patch later in the series.
> 
> I would have to examine the patch in question in more detail, but I am
> waiting for the series to be posted to Git mailing list (though I can be
> persuaded to do pre-review on GitHub pull request, as pre-release step).
> 
> > Regression when computing bloom filters
> > =======================================
> >
> > While wrestling with the problem above, I noticed that
> > `commit_gen_cmp()` lies in “commit-graph write” path but uses the
> > helpers. But the helper would always return `GENERATION_NUMBER_MAX`
> > as we found while moving generation number into a slab.
> >
> > Digging through git logs, I found [the patch][1], which introduced
> > `commit_gen_cmp()`. Sorting commits make the computing bloom filters
> > much, much faster. The helper returns the same value every time, which
> > makes sorting pointless and nullifies the performance boost. Fixing
> > this was easy, bypass the helper and access generation number.
> >
> > [1]: https://github.com/git/git/commit/3d11275505694ce4e5256516de1c5dd90e749303
> 
> Good catch!
> 
> Best,
> --
> Jakub Narębski



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux