Git Pack: Improving cache performance (maybe a good GSoC practice)

Sebastian Thiel <byronimo@xxxxxxxxxxxxxx> · Wed, 30 Mar 2011 00:21:23 +0200

Hi,

What follows is a summary of how I approached the git cache in order to 
write my own improved version. The conclusions can be found further 
down, in case you want to skip all the extra words for now.

I am currently working on a c++ implementation of the git core, which 
for now includes reading and writing of loose objects, as well as 
reading and verifying pack files. Actually, this is not the first time I 
do this, as I made my first experience in that matter with a pure python 
implementation of the git core 
(https://github.com/gitpython-developers/gitdb). This time though, I 
wanted to see whether I can achieve better performance, and how I can 
make git more suitable to handle big files.

When profiling my initial uncached version of my pack reading 
implementation, I noticed that most of the time was actually spent in 
zlibs inflate method. Apparently, a cache was a good idea - git already 
has one. Ignoring its implementation, I wrote my own naive one right 
away which stored only base objects and inflated deltas. Interestingly, 
this could already double the performance of my test case, which would 
just stream all data of all objects contained in the pack, sha by sha, 
resembling a random access pattern.

To compare my cache with git, I implemented pack verification, which 
basically generates sha1 of all uncompressed objects in the pack, and 
runs a crc32 on the compressed streams. As the objects are streamed 
ordered by offset, the access pattern can be described as sequential.

It came at no surprise, that git would verify my test-pack (aggressively 
packed git source repository with 137k objects, one 27mb pack) much 
faster, i.e. 25%. After some profiling and optimizations, I could bring 
it down to being just 15% faster. Considering that my cache, which only 
got faster in the course of the optimizations, could speed up random 
access by a factor of about 2.5, it was hard to understand that I 
couldn't reach git's performance.

The major difference turned out to be the way the cache works. Git has a 
small delta cache with only 256 [hardcoded] cache entries and a default 
memory limit of 16mb. There it stores fully decompressed objects. It 
maps objects to entries by hashing their pack offsets into the available 
range of entries. When the pack is accessed sequentially, the cache will 
be filled with related uncompressed objects, which can in turn reduce 
the time required to apply the next delta by a huge amount, as only a 
single delta has to be applied instead of a possibly long delta chain. 
As git appears to pack deltas of related objects close to each other 
(regarding their offset in the pack), the cache will be hit quite often 
automatically. As the number of entries is small, and as entries are 
connected using a doubly linked list, reclaiming of memory is rather 
efficient, as it hits the first used objects first, which are unlikely 
to be needed ever again. Collection doesn't necessarily run too often as 
well, as most entries will be overwritten with new data during hash 
collisions.
This cache implementation is clearly suitable for sequential access.

My cache was optimized for random access, hence it stores only base 
objects and uncompressed delta streams, using many more entries to 
achieve good cache hit ratios. The reason for the performance gain of 
the random access cache was that it stores full objects. This fills the 
cache memory up much faster, so having a lot of cache entries makes no 
sense.

Both cache types are optimized for different kinds of access modes, and 
both are required to efficiently deal with everything git usually has to do.
Hence I changed my cache to support both modes, and rerun the pack 
verification test.

The result was better than expected, as the my implementation now takes 
the lead by a tiny amount (25.3s vs. 26.0s) with a 16mb cache size. On 
my way to make it even faster, I experimented with different cache 
sizes, amounts of entries and of course different packs, which ranged 
from 20mb to 600mb, which helped me fine tune the relations of these 
variables.
In the end, with a cache of 27mb, my implementation took 20.6s, whereas 
the git implementation could only improve slightly to finish after 25.3s.
I believe the cause of this is the fixed amount of entries. My cache 
adjusts this amount depending on the packs size, the amount of objects, 
as well as the size of the cache. In the this case, my cache would have 
nearly 1000 entries, which helped to spread the amount of available 
memory. Due to the limited amount of entries, git will not even benefit 
from further increasing the size, whereas I could get as low as 13.5s by 
increasing the cache size to 48mb for instance.

Just for the fun of it, I increased the amount of entries in the git 
cache to the same amount my cache was using, and suddenly git was 
performing equally well, finishing after just 20.8s with a 27mb cache size.

As my random access cache performed worse in sequential access mode, I 
ran a test to see whether the opposite is true as well: Does the 
sequential cache harm performance in random access mode ? The answer is: 
Yes it does ! To show some numbers:  34mb of objects per second could be 
streamed without cache, which was reduced to 28mb/s with a random access 
cache. The cache in that case just causes overhead (especially when 
reclaiming memory), and is hit just rarely.

To test my assumptions not only with my code, but also with git itself, 
I used a test written for git-python, which streams blobs from the 27mb 
git pack. With the default cache, I get 14mb/s. When I removed the 
cache, it was upped to 15mb, which was less than expected, but we must 
not forget the git-python overhead here. Finally, with the sequential 
access cache enabled, its entries increased to 1000, and the cache size 
upped to 27mb, suddenly I would get 34.2mb/s ! A new record, for 
git-python at least ;).

As a final disclaimer, please let me emphasize that the tests I run are 
neither statistically profound, nor are the pack verification tests 
necessarily comparable in all details. Additionally, the git-python 
object throughput tests cannot be directly compared to the c++ test 
which has much less overhead. The tests were made to show performance 
relations and uncover ways to improve performance, and not to claim that 
one implementation is 'better' than the other.

-- Conclusions --
* delta cache needs to deal with random and sequential access.
* current implementation deals with sequential access only, which is 
only suitable for pack verification, and in fact hurts performance in 
other cases if the amount of entries (at least) is not dynamically 
adjusted depending on parameters of the actual pack.
* random access caches work well with plenty of entries, when storing 
only uncompressed deltas and base objects, as reapplying a delta is very 
fast.
* Sequential access caches have to dynamically adjust their amount of 
entries according to the amount of available cache memory and the 
average packed object size, to make best use of the available memory.
* it should be possible to adjust the caching mode at runtime, or to 
fully disable the cache.
* it might be useful/necessary to have one cache per pack sharing global 
memory limits, instead of having one global cache, as caches need to be 
adjusted depending on the actual pack.

In case anyone is interested in having a look at the way the I determine 
the cache parameters (which really are the key to optimizing 
performance), this is the line you would have to focus on: 
https://github.com/Byron/gitplusplus/blob/deltastream/src/git/db/pack_file.cpp#L103. 
The cache is used by the pack stream, whose core is in the 
unpack_object_recursive method (equivalent to unpack_delta_entry in the 
git source): 
https://github.com/Byron/gitplusplus/blob/deltastream/src/git/db/pack_stream.cpp#L247 
.

Kind Regards,
Sebastian

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html