Michael Witten <mfwitten@xxxxxxxxx> wrote: > On Thu, Apr 30, 2009 at 07:17, Jakub Narebski <jnareb@xxxxxxxxx> wrote: > > I hope that JGit developers can > > tell us whether using higher level language affects performance, how > > much, and what features of higher-level language are causing decrease > > in performance. > > Java is definitely higher than C, but you can do some pretty low-level > operations on bits and bytes and the like, not to mention the presence > of a JIT. But its still costly compared to C. > My point: I don't think that Java can tell us anything special in this regard. Sure it can. Peff I think made a good point here, that we rely on a lot of small tweaks in the C git code to get *really* good performance. 5% here, 10% there, and suddenly you are 60% faster than you were before. Nico, Linus, Junio, they have all spent some time over the past 3 or 4 years trying to tune various parts of Git to just flat out run fast. Higher level languages hide enough of the machine that we can't make all of these optimizations. JGit struggles with not having mmap(), or when you do use Java NIO MappedByteBuffer, we still have to copy to a temporary byte[] in order to do any real processing. C Git avoids that copy. Sure, other higher level langauges may offer a better mmap facility, but they also tend to offer garbage collection and most try to tie the mmap management into the GC "for safety and ease of use". JGit struggles with not having unsigned types in Java. There are many locations in JGit where we really need "unsigned int32_t" or "unsigned long" (largest machine word available) or "unsigned char" but these types just don't exist in Java. Converting a byte up to an int just to treat it as an unsigned requires an extra " & 0xFF" operation to remove the sign extension. JGit struggles with not having an efficient way to represent a SHA-1. C can just say "unsigned char[20]" and have it inline into the container's memory allocation. A byte[20] in Java will cost an *additional* 16 bytes of memory, and be slower to access because the bytes themselves are in a different area of memory from the container object. We try to work around it by converting from a byte[20] to 5 ints, but that costs us machine instructions. C Git takes for granted that memcpy(a, b, 20) is dirt cheap when doing a copy from an inflated tree into a struct object. JGit has to pay a huge penalty to copy that 20 byte region out into 5 ints, because later on, those 5 ints are cheaper. Other higher level languages also lack the ability to mark a type unsigned. Or face similiar penalties with storing a 20 byte binary region. Native Java collection types have been a snare for us in JGit. We've used java.util.* types when they seem to be handy and already solve the data structure problem at hand, but they tend to preform a lot worse than writing a specialized data structure. For example, we have ObjectIdSubclassMap for what should be Map<ObjectId,Object>. Only it requires that the Object type you use as the "value" entry in the map extend from ObjectId, as the instance serves as both key *and* value. But it screams when compared to HashMap<ObjectId,Object>. (For those who don't know, ObjectId is JGit's "unsigned char[20]" for a SHA-1.) Just a day or so ago I wrote LongMap, a faster HashMap<Long,Object>, for hashing objects by indexes in a pack file. Again, the boxing costs in Java to convert a "long" (largest integer type) into an Object that the standard HashMap type would accept was rather high. Right now, JGit is still paying dearly when it comes to ripping apart a commit or a tree object to follow the object links. Or when invoking inflate(). We spend a lot more time doing this sort of work than C git does, and yet we're trying to be as close to the machine as we can go by using byte[] whenever possible, by avoiding copying whenever possible, and avoiding memory allocation when possible. Notably, `rev-list --objects --all` takes about 2x as long in JGit as it does in C Git on a project like the linux kernel, and `index-pack` for the full ~270M pack file takes about 2x as long. Both parts of JGit are about as good as I know how to make them, but we're really at the mercy of the JIT, and changes in the JIT can cause us to perform worse (or better) than before. Unlike in C Git where Linus has done assembler dumps of sections of code and tried to determine better approaches. :-) So. Yes, its practical to build Git in a higher level language, but you just can't get the same performance, or tight memory utilization, that C Git gets. That's what that higher level language abstraction costs you. But, JGit performs reasonably well; well enough that we use internally at Google as a git server. -- Shawn. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html