Re: Histogram diff, libgit2 enhancement, libgit2 => git merge (GSOC)

Jonathan Nieder <jrnieder@xxxxxxxxx> · Sun, 20 Mar 2011 20:27:08 -0500

Hi,

Vicent Marti wrote:

> Merging libgit2 into upstream Git is a scary as fuck task. Somebody
> put it up on the Wiki ideas page, but that was not me

Cc-ing Ram (who added it), in case he has anything to add.

> -- I'm
> personally doubtful of anybody succeeding on doing that project during
> the SoC,

I agree there --- it is a huge task.  But maybe it could inspire
someone to come up with a smaller task.  One long-term goal might be
to get libgit2 and core git to share revision walking APIs; a baby
step towards that would be a proof-of-concept patch to share object
access APIs.

If someone wants to work on this, I'd be glad to talk over what would
be needed to make a realistic proposal.

> so I have very little interest on mentoring the task.

That's okay, of course.  What's probably important for people
considering this project is: would you be willing to answer questions
and consider patches from a person working on this?  That is, do you
consider the goal even worthwhile?

I am probably not the best person to mentor this but if no one else
wants to then I would be interested.

> Here's what's going on: The Git code base is hairy and not that well
> documented, so you're gonna need to study that quite a bit. I like to
> think that the libgit2 code base is not hairy, and is pretty well
> documented (I'm an optimistic guy), but you're still going to need
> quite a bit of research to understand the whole architecture before
> you can actually merge anything into Git.

Like the Linux kernel, the git codebase does not have many comments
alongside the code, it is true.  But it is actually incredibly well
documented in my experience.  The best documentation is in the
history.  In addition to that, there is some API documentation in
Documentation/technical.

A good place to start is the initial commit e83c516 (Initial revision
of "git", the information manager from hell, 2005-04-07).  The
architecture described therein is very simple and still exists today
with few changes.

To explain something that has come later, the easiest way is to learn
how the author explained it when the change was made.

Let me give an example.  Suppose I am wondering how git decides what
commits to show when I say "git log ^topic1 topic2".  In particular, I
wonder what the performance characteristics of that operation are and
how it is able to print the first result without spending O(depth of
history) to traverse all the ancestors of topic1 going back to the
beginning of time.

First step: what does "git log" do with that "^topic1 topic2"?  Wait,
where is the "log" command defined in the first place?

 $ git grep -e '"log"'
[...]
 git.c:          { "log", cmd_log, RUN_SETUP },
[...]

Ok, it's the cmd_log function.  Looking at the definition of that
function, it seems that it does

	init_revisions(&rev, prefix);
	rev.always_show_header = 1;
	memset(&opt, 0, sizeof(opt));
	opt.def = "HEAD";
	cmd_log_init(argc, argv, prefix, &rev, &opt);
	return cmd_log_walk(&rev);

 $ git grep -e init_revisions -- Documentation
 Documentation/technical/api-revision-walking.txt:`init_revisions`::

The revision walking API is explained in the api-revision-walking.txt
document.  From this we learn that responsibility for the revision
walk is divided between prepare_revision_walk and get_revision,
defined in revision.c.

prepare_revision_walk seems to use functions "handle_commit" and
"commit_list_insert_by_date".  What do they do?

 $ git log -p -Shandle_commit -- revision.c
 commit cd2bdc5309461034e5cc58e1d3e87535ed9e093b
 Author: Linus Torvalds <torvalds@xxxxxxxx>
 Date:   Fri Apr 14 16:52:13 2006 -0700

     Common option parsing for "git log --diff" and friends

     This basically does a few things that are sadly somewhat interdependent,
[...]
     Now, that was the easy and straightforward part.

     The slightly more involved part is that some of the programs that want to
     use the new-and-improved rev_info parsing don't actually want _commits_,
     they may want tree'ish arguments instead. That meant that I had to change
     setup_revision() to parse the arguments not into the "revs->commits" list,
     but into the "revs->pending_objects" list.

     Then, when we do "prepare_revision_walk()", we walk that list, and create
     the sorted commit list from there.

Okay: so in revision walking:

 - first (in setup_revisions), git pushes the ^topic1 and topic2
   commits onto a list called "pending_objects";
 - next, in prepare_revision_walk, it walks through the pending
   objects list and inserts them in a commit list, sorted by date;

and next?

 $ git log -Sget_revision -- revision.c
[...]
 commit a4a88b2bab3b6fb0b30f63418701f42388e0fe0a
 Author: Linus Torvalds <torvalds@xxxxxxxx>
 Date:   Tue Feb 28 11:24:00 2006 -0800

     git-rev-list libification: rev-list walking

     This actually moves the "meat" of the revision walking from rev-list.c
     to the new library code in revision.h. It introduces the new functions

         void prepare_revision_walk(struct rev_info *revs);
         struct commit *get_revision(struct rev_info *revs);

     to prepare and then walk the revisions that we have.

     Signed-off-by: Linus Torvalds <torvalds@xxxxxxxx>
     Signed-off-by: Junio C Hamano <junkio@xxxxxxx>

Well, that's actually not so helpful.  I mean, it tells us that
get_revision is what takes care of the revision walk, but it doesn't
tell us what the revision walk consists of.

So here we need another trick to get at the meat of the matter ---
we need to know where this "revision walking from rev-list.c" came
from.  Ah:

 $ git log -- rev-list.c
[...]
 commit 64745109c41a5c4a66b9e3df6bca2fd4abf60d48
 Author: Linus Torvalds <torvalds@xxxxxxxxxxxxxxx>
 Date:   Sat Apr 23 19:04:40 2005 -0700

     Add "rev-list" program that uses the new time-based commit listing.

     This is probably what you'd want to see for "git log".

And the answer is there in the patch for a commit that comes after that
(8906300, git-rev-list: use proper lazy reachability analysis,
2005-05-30).

Heh, probably I didn't choose the best example. :)  A short article
about this in Documentation/technical certainly wouldn't be a bad
thing.

In addition to "git log -S" as used above, I tend to find "git blame -L"
helpful FWIW.  And people on the list can be helpful, too.

> (libgit2 is reentrant and mostly threadsafe, so there's quite the
> architecture mismatch there),

Could you expand on that a little?  I understand that a lot of git
code wouldn't be usable for libgit2 as-is and that there is going to
be some overhead from, say, using malloc to initialize buffers instead
of relying on static ones.  But does that deserve to be called an
architecture mismatch?  Would that make it hard to reuse libgit2 code
within git?

I'd be very interested in learning about more substantial differences
in approach.  Probably the two codebases could learn a lot from each
other's design.

> Overall, you'd need balls of steel

Here I agree.

> HOWEVER. If you want to do something libgit2-related for the SoC
> (which would be awesome), there's still two options:
>
> a) Help us make the library more awesome by implementing new features!
> This task is the opposite the previous one; it's like full of unicorns
> and rainbows. You can choose one (or more) features we are missing,
> and see how to implement them in libgit2 while making them reentrant,
> threadsafe AND faster. It's not easy, but it's fucking cool. And you
> get to do a lot of micro-optimization if you're into that.

Note that if this is your kind of thing, you might consider sending
"libification patches" to modify the code in git while at it.  That
means free code review and free bugfixes from then on if your changes
are accepted.

> b) Write a minimal Git client using libgit2. Peff keeps bringing this
> up and I think it's a bangin' good idea. Write something small and
> 100% self contained in a C executable that runs everywhere with 0
> dependencies -- don't aim for full feature completion, just the basic
> stuff to interoperate with a Git repository.

I agree that this would be very neat, too.

> So, yeah. That's pretty much my libgit2-related advice for the SoC.

Thanks again, Vicent, for these very useful explanations.

> Best of luck with your application process with whatever project you decide,
> Vicent

Seconded. :)

Hope that helps,
Jonathan
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html