Hi, Vicent Marti wrote: > Merging libgit2 into upstream Git is a scary as fuck task. Somebody > put it up on the Wiki ideas page, but that was not me Cc-ing Ram (who added it), in case he has anything to add. > -- I'm > personally doubtful of anybody succeeding on doing that project during > the SoC, I agree there --- it is a huge task. But maybe it could inspire someone to come up with a smaller task. One long-term goal might be to get libgit2 and core git to share revision walking APIs; a baby step towards that would be a proof-of-concept patch to share object access APIs. If someone wants to work on this, I'd be glad to talk over what would be needed to make a realistic proposal. > so I have very little interest on mentoring the task. That's okay, of course. What's probably important for people considering this project is: would you be willing to answer questions and consider patches from a person working on this? That is, do you consider the goal even worthwhile? I am probably not the best person to mentor this but if no one else wants to then I would be interested. > Here's what's going on: The Git code base is hairy and not that well > documented, so you're gonna need to study that quite a bit. I like to > think that the libgit2 code base is not hairy, and is pretty well > documented (I'm an optimistic guy), but you're still going to need > quite a bit of research to understand the whole architecture before > you can actually merge anything into Git. Like the Linux kernel, the git codebase does not have many comments alongside the code, it is true. But it is actually incredibly well documented in my experience. The best documentation is in the history. In addition to that, there is some API documentation in Documentation/technical. A good place to start is the initial commit e83c516 (Initial revision of "git", the information manager from hell, 2005-04-07). The architecture described therein is very simple and still exists today with few changes. To explain something that has come later, the easiest way is to learn how the author explained it when the change was made. Let me give an example. Suppose I am wondering how git decides what commits to show when I say "git log ^topic1 topic2". In particular, I wonder what the performance characteristics of that operation are and how it is able to print the first result without spending O(depth of history) to traverse all the ancestors of topic1 going back to the beginning of time. First step: what does "git log" do with that "^topic1 topic2"? Wait, where is the "log" command defined in the first place? $ git grep -e '"log"' [...] git.c: { "log", cmd_log, RUN_SETUP }, [...] Ok, it's the cmd_log function. Looking at the definition of that function, it seems that it does init_revisions(&rev, prefix); rev.always_show_header = 1; memset(&opt, 0, sizeof(opt)); opt.def = "HEAD"; cmd_log_init(argc, argv, prefix, &rev, &opt); return cmd_log_walk(&rev); $ git grep -e init_revisions -- Documentation Documentation/technical/api-revision-walking.txt:`init_revisions`:: The revision walking API is explained in the api-revision-walking.txt document. From this we learn that responsibility for the revision walk is divided between prepare_revision_walk and get_revision, defined in revision.c. prepare_revision_walk seems to use functions "handle_commit" and "commit_list_insert_by_date". What do they do? $ git log -p -Shandle_commit -- revision.c commit cd2bdc5309461034e5cc58e1d3e87535ed9e093b Author: Linus Torvalds <torvalds@xxxxxxxx> Date: Fri Apr 14 16:52:13 2006 -0700 Common option parsing for "git log --diff" and friends This basically does a few things that are sadly somewhat interdependent, [...] Now, that was the easy and straightforward part. The slightly more involved part is that some of the programs that want to use the new-and-improved rev_info parsing don't actually want _commits_, they may want tree'ish arguments instead. That meant that I had to change setup_revision() to parse the arguments not into the "revs->commits" list, but into the "revs->pending_objects" list. Then, when we do "prepare_revision_walk()", we walk that list, and create the sorted commit list from there. Okay: so in revision walking: - first (in setup_revisions), git pushes the ^topic1 and topic2 commits onto a list called "pending_objects"; - next, in prepare_revision_walk, it walks through the pending objects list and inserts them in a commit list, sorted by date; and next? $ git log -Sget_revision -- revision.c [...] commit a4a88b2bab3b6fb0b30f63418701f42388e0fe0a Author: Linus Torvalds <torvalds@xxxxxxxx> Date: Tue Feb 28 11:24:00 2006 -0800 git-rev-list libification: rev-list walking This actually moves the "meat" of the revision walking from rev-list.c to the new library code in revision.h. It introduces the new functions void prepare_revision_walk(struct rev_info *revs); struct commit *get_revision(struct rev_info *revs); to prepare and then walk the revisions that we have. Signed-off-by: Linus Torvalds <torvalds@xxxxxxxx> Signed-off-by: Junio C Hamano <junkio@xxxxxxx> Well, that's actually not so helpful. I mean, it tells us that get_revision is what takes care of the revision walk, but it doesn't tell us what the revision walk consists of. So here we need another trick to get at the meat of the matter --- we need to know where this "revision walking from rev-list.c" came from. Ah: $ git log -- rev-list.c [...] commit 64745109c41a5c4a66b9e3df6bca2fd4abf60d48 Author: Linus Torvalds <torvalds@xxxxxxxxxxxxxxx> Date: Sat Apr 23 19:04:40 2005 -0700 Add "rev-list" program that uses the new time-based commit listing. This is probably what you'd want to see for "git log". And the answer is there in the patch for a commit that comes after that (8906300, git-rev-list: use proper lazy reachability analysis, 2005-05-30). Heh, probably I didn't choose the best example. :) A short article about this in Documentation/technical certainly wouldn't be a bad thing. In addition to "git log -S" as used above, I tend to find "git blame -L" helpful FWIW. And people on the list can be helpful, too. > (libgit2 is reentrant and mostly threadsafe, so there's quite the > architecture mismatch there), Could you expand on that a little? I understand that a lot of git code wouldn't be usable for libgit2 as-is and that there is going to be some overhead from, say, using malloc to initialize buffers instead of relying on static ones. But does that deserve to be called an architecture mismatch? Would that make it hard to reuse libgit2 code within git? I'd be very interested in learning about more substantial differences in approach. Probably the two codebases could learn a lot from each other's design. > Overall, you'd need balls of steel Here I agree. > HOWEVER. If you want to do something libgit2-related for the SoC > (which would be awesome), there's still two options: > > a) Help us make the library more awesome by implementing new features! > This task is the opposite the previous one; it's like full of unicorns > and rainbows. You can choose one (or more) features we are missing, > and see how to implement them in libgit2 while making them reentrant, > threadsafe AND faster. It's not easy, but it's fucking cool. And you > get to do a lot of micro-optimization if you're into that. Note that if this is your kind of thing, you might consider sending "libification patches" to modify the code in git while at it. That means free code review and free bugfixes from then on if your changes are accepted. > b) Write a minimal Git client using libgit2. Peff keeps bringing this > up and I think it's a bangin' good idea. Write something small and > 100% self contained in a C executable that runs everywhere with 0 > dependencies -- don't aim for full feature completion, just the basic > stuff to interoperate with a Git repository. I agree that this would be very neat, too. > So, yeah. That's pretty much my libgit2-related advice for the SoC. Thanks again, Vicent, for these very useful explanations. > Best of luck with your application process with whatever project you decide, > Vicent Seconded. :) Hope that helps, Jonathan -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html