On Tue, Mar 29, 2011 at 15:21, Sebastian Thiel <byronimo@xxxxxxxxxxxxxx> wrote: > I am currently working on a c++ implementation of the git core, which for > now includes reading and writing of loose objects, as well as reading and > verifying pack files. Have considered wrapping libgit2 with a C++ binding? Just curious. > Actually, this is not the first time I do this, as I > made my first experience in that matter with a pure python implementation of > the git core (https://github.com/gitpython-developers/gitdb). I think I saw this the other week... why this project vs. using Dulwich[1]? [1] http://samba.org/~jelmer/dulwich/ > This time > though, I wanted to see whether I can achieve better performance, and how I > can make git more suitable to handle big files. A noble goal... > When profiling my initial uncached version of my pack reading > implementation, I noticed that most of the time was actually spent in zlibs > inflate method. Yes. The profile is somewhere in this ballpark if Git is doing rev-list --objects, aka the "Counting" phase of a git clone: - 30% in zlib inflate() - 30% in object map lookup/insertion - 30% misc. elsewhere > The major difference turned out to be the way the cache works. Git has a > small delta cache with only 256 [hardcoded] cache entries and a default > memory limit of 16mb. There it stores fully decompressed objects. It maps > objects to entries by hashing their pack offsets into the available range of > entries. Right, a very simple cache. FWIW, I've tried to use more complex cache rules inside of JGit, to no avail. A more complex cache implementation (e.g. one that supports a limited number of collisions in the hash buckets and uses a full LRU) runs slow enough relative to this simple cache that performance actually gets worse. > When the pack is accessed sequentially, the cache will be filled > with related uncompressed objects, which can in turn reduce the time > required to apply the next delta by a huge amount, as only a single delta > has to be applied instead of a possibly long delta chain. Yes... mostly. > As git appears to > pack deltas of related objects close to each other (regarding their offset > in the pack), This isn't true. Git packs object by time, *not* delta ordering. However objects are delta compressed by commonality on tree path *and* time. An example repository I like to play with is the linux-2.6 repository; in that repository the pack is around 370 MiB. If you break the pack up into 1 MiB slices by offset, you will find that an object at the end of a 50 deep delta chain touches about 50 unique 1 MiB slices in order to build itself up. :-) This is caused by things being clustered by both time and path. If a path is heavily modified within a short time period, sure, those will be clustered together in the file. But if a path is rarely modified, its objects will be distributed throughout the file. > the cache will be hit quite often automatically. The hit rate happens to work well because most uses access less than 256 distinct similar things at once. I forget what the stats are for the linux-2.6 repository, but I think there are less than 256 unique directories. As Git walks through the history sequentially from most-recent to least-recent, its priming the cache with objects that have very short delta chains and are thus more likely to be used as delta bases for objects later in the file. Since each directory or file acts as a delta base for someone else later, its likely to be in this cache as the reader walks backwards through time. As bases switch, the cache is updated at a relatively low penalty, because the new base was itself recently accessed using the base that is already in the cache. The simple % 256 rule the cache uses is effective because objects are pretty randomly allocated as far as offsets go in the file. We just damn lucky. :-) > This cache implementation is clearly suitable for sequential access. Yes. > Both cache types are optimized for different kinds of access modes, and both > are required to efficiently deal with everything git usually has to do. > Hence I changed my cache to support both modes, and rerun the pack > verification test. > > The result was better than expected, as the my implementation now takes the > lead by a tiny amount (25.3s vs. 26.0s) with a 16mb cache size. On my way to This isn't a very significant speed difference given the differences in implementation. We're not really looking to shave 3% off the running time for operation X, we're looking to shave >10%. > make it even faster, I experimented with different cache sizes, amounts of > entries and of course different packs, which ranged from 20mb to 600mb, > which helped me fine tune the relations of these variables. > In the end, with a cache of 27mb, my implementation took 20.6s, whereas the OK, this is pretty significant. Saving 21% of the running time, at the expense of an extra 11M of working set. But the verify pack workload is pretty useless, nobody accesses data by SHA-1 order. Most uses of Git are going backwards through time. log and blame are the two notable things that happen *a lot* and that users complain about being slow. These also aren't random accesses, there is a definite pattern and the pattern can be exploited. I'm really only interested in improving these two patterns. As far as verify-pack improving, Junio improved it by switching to use index-pack with the new --verify flag. There really isn't a faster way to scan through a pack than the way index-pack does it. So, all I'm trying to say is, verify-pack isn't the right thing to target when you are looking at "how do I make Git faster". > -- Conclusions -- > * delta cache needs to deal with random and sequential access. I'm not sure where the random access case is coming from. Who is doing random access except verify-pack? > * current implementation deals with sequential access only, which is only > suitable for pack verification, Not true. First, pack verification is horrifically random, since its by SHA-1 order and not sequential order. Second, every other use of the pack data is generally sequential in time, because every other use is starting from the current revisions as found from the refs and walking backwards in time, which is forwards sequentially in the pack. -- Shawn. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html