Re: Questions on GSoC 2019 Ideas

Matheus Tavares Bernardino <matheus.bernardino@xxxxxx> · Fri, 5 Apr 2019 13:28:48 -0300

On Thu, Apr 4, 2019 at 4:56 AM Christian Couder
<christian.couder@xxxxxxxxx> wrote:
>
> Hi,
>
> On Thu, Apr 4, 2019 at 3:15 AM Matheus Tavares Bernardino
> <matheus.bernardino@xxxxxx> wrote:
> >
> > I've been studying the codebase and looking for older emails in the ML
> > that discussed what I want to propose as my GSoC project. In
> > particular, I found a thread about slow git commands on chromium, so I
> > reached them out at chromium's ML to ask if it's still an issue. I got
> > the following answer:
> >
> > On Wed, Apr 3, 2019 at 1:41 PM Erik Chen <erikchen@xxxxxxxxxxxx> wrote:
> > > Yes, this is absolutely still a problem for Chrome. I filed some bugs for common operations that are slow for Chrome: git blame [1], git stash [2], git status [3]
> > > On Linux, blame is the only operation that is really problematic. On macOS and Windows ... it's hard to find a git operation that isn't slow. :(
>
> Nice investigation. About git status I wonder though if they have
> tried the possible optimizations, like untracked cache or
> core.fsmonitor.

I don't know if they did, but I suggested them to check
core.commitGraph, pack.useBitmaps and core.untrackedCache (which Duy
suggested me in another thread).

> > I don't really know if treading would help stash and status, but I
> > think it could help blame. By the little I've read of blame's code so
> > far, my guess is that the priority queue used for the commits could be
> > an interface for a producer-consumer mechanism and that way,
> > assign_blame's main loop could be done in parallel. And as we can se
> > at [4], that is 90% of the command's time. Does this makes sense?
>
> I can't really tell as I haven't studied this, but from the links in
> your email I think it kind of makes sense.
>
> Instead of doing assign_blame()'s main loop in parallel though, if my
> focus was only making git blame faster, I think I would first try to
> cache xdl_hash_record() results and then if possible to compute
> xdl_hash_record() in parallel as it seems to be a big bottleneck and a
> quite low hanging fruit.

Hm, I see. But although it would take more effort to add threading at
assign_blame(), wouldn't it be better because more work could be done
in parallel? I think it could be implemented in the same fashion git
grep does.

> > But as Duy pointed out, if I recall correctly, for git blame to be
> > parallel, pack access and diff code would have to be thread-safe
> > first. And also, it seems, by what we've talked earlier, that this
> > much wouldn't fit all together in a single GSoC. So, would it be a
> > nice GSoC proposal to try "making code used by blame thread-safe",
> > targeting a future parallelism on blame to be done after GSoC?
>
> Yeah, I think it would be a nice proposal, even though it doesn't seem
> to be the most straightforward way to make git blame faster.
>
> Back in 2008 when we proposed a GSoC about creating a sequencer, it
> wasn't something that would easily fit in a GSoC, and in fact it
> didn't, but over the long run it has been very fruitful as the
> sequencer is now used by cherry-pick and rebase -i, and there are
> plans to use it even more. So unless people think it's not a good idea
> for some reason, which hasn't been the case yet, I am ok with a GSoC
> project like this.
>
> > And if
> > so, could you please point me out which files should I be studying to
> > write the planning for this proposal? (Unfortunately I wasn't able to
> > study pack access and diff code yet. I got carried on looking for
> > performance hostposts and now I'm a bit behind schedule :(
>
> I don't think you need to study everything yet, and I think you
> already did a lot of studying, so I would suggest you first try to
> send soon a proposal with the information you have right now, and then
> depending on the feedback you get and the time left (likely not
> much!!!), you might study some parts of the code a bit more later.

Thanks a lot, Christian. I'm writing my proposal and will try to send it today.

> > Also, an implementation for fuzzy blame is being developer right
> > now[5] and Jeff (CC-ed) suggested recently another performance
> > improvement that could be done in blame[6]. So I would like to know
> > wether you think it is worthy putting efforts trying to parallelize
> > it.
>
> What you would do seems compatible to me with the fuzzy blame effort
> and an effort to cache xdl_hash_record() results.
>
> Thanks,
> Christian.