On Sat, Jul 16, 2011 at 03:24, Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> wrote: > Just out of curiousity and because I happen to know we have Googlers > here. If it's not confidential, are there any changes in git to make > it work with Google Code? I am particularly interested in whether > Google modifies git to use bigtable A major milestone in Git was adding smart HTTP. If you watch the talk Sverre linked to, you will learn that Google is based heavily on HTTP. A fundamental issue at the time Hg on Google Code was added was Git didn't really work well over HTTP. Adding smart HTTP in Git 1.6.6 made it more realistic for Google to support Git on Project Hosting. I added smart HTTP support for kernel.org so their users behind firewalls could still use an efficient Git protocol to fetch revisions from kernel.org for projects that are hosted there. Its a nice bonus that this work made Git on Google Code more realistic for Google. We are trying to get the engineer responsible for making Git on Google Code possible to give a recorded tech talk like the one Sverre linked to. I don't want to steal his thunder, but I can say the Git on Google Code work is not based on C Git or JGit. :-) > (or cassandra, I remember Shawn > had a prototype). This was an unrelated project, and is what I deem to be a failure... quite unlike Git on Google Code. :-) For some background, at GitTogether in Oct. 2010 I showed a demo of JGit using the Apache Cassandra database as an object / reference store. This prototype didn't really scale well; even though I demoed the linux-2.6 repository being cloned through a JGit daemon using Cassandra as the backing store, it was slow and used too much CPU and memory resources to be useful in any context beyond a "Look, I can do this!" demo. I managed to open source this work, it may still be laying around somewhere, but I basically threw it out the window and said "that isn't good, and I can't believe I put my name on it!". (And for the record I was not the first to try this, Scott Chacon at GitHub tried something similar first and demoed it at GitTogether in 2009.) In late Jan/Feb 2011 I released a series of patches for JGit that added what I called "DHT" (distributed hash table) support. These patches are now part of the JGit project. Its different from the original Cassandra prototype. With this work, JGit tries to treat the DHT as though it were a virtual memory system. Relatively standard pack files are segmented into ~ 1 MiB chunks, then stored into the DHT with row keys based on the SHA-1 hash of the content of the "pack chunk". The bet here is that the locality of data in a pack file is quite good, so loading a chunk of commits ~1 MiB in size should get us a number of related commits, amortizing out the round-trip time to the database. This was to resolve one of the latency problems I saw with the Cassandra prototype, which stored 1 commit per row and had awful performance during a major revision traversal like a clone has. The JGit DHT work lead me to discover the pack locality is not as good as we think it is. Its good, but it can be better. I added some patches to JGit's PackWriter to reorder objects in an order that gave better data locality. After Junio and I started sharing an office, I began nagging him about this locality problem in Git pack files... and that nagging lead to a series of patches Junio posted about a week back to improve pack-objects.c. The improvement is small on local disk, it reduces some minor page faults, however there isn't much difference in overall running time. Over higher latency filesystems however, like an NFS server in another city, it should help reading. Just recently I posted a message to the jgit-dev mailing list saying I also now think JGit DHT isn't a viable solution, and am likely to discard it in the future. Its implementation is very complicated, and it just doesn't perform as well as I had hoped. FWIW, this work was not for Google, but was for open source Git hosting sites like source.android.com, eclipse.org, KDE, etc. where they need to manage a large number of Git repositories, and want to have hot-failover and load-balancing to reduce down time caused by hardware failures. Unfortunately it hasn't been panning out, because the performance loss is a lot compared to the small administrative improvements it might bring. Not to mention the additional complexity of running the clustered database vs. just a bunch of Git repositories in a directory. I can tell you that none of this is what Git for Google Code does. As for how Git on Google Code is implemented... you'll just have to wait for the tech talk from the engineer responsible. I can say it wasn't me, and it wasn't Junio. I am too busy with JGit and Gerrit Code Review, and Junio is too busy being Git maintainer to work on a major new feature like this. :-) -- Shawn. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html