Hi,
What follows is a summary of how I approached the git cache in order to
write my own improved version. The conclusions can be found further
down, in case you want to skip all the extra words for now.
I am currently working on a c++ implementation of the git core, which
for now includes reading and writing of loose objects, as well as
reading and verifying pack files. Actually, this is not the first time I
do this, as I made my first experience in that matter with a pure python
implementation of the git core
(https://github.com/gitpython-developers/gitdb). This time though, I
wanted to see whether I can achieve better performance, and how I can
make git more suitable to handle big files.
When profiling my initial uncached version of my pack reading
implementation, I noticed that most of the time was actually spent in
zlibs inflate method. Apparently, a cache was a good idea - git already
has one. Ignoring its implementation, I wrote my own naive one right
away which stored only base objects and inflated deltas. Interestingly,
this could already double the performance of my test case, which would
just stream all data of all objects contained in the pack, sha by sha,
resembling a random access pattern.
To compare my cache with git, I implemented pack verification, which
basically generates sha1 of all uncompressed objects in the pack, and
runs a crc32 on the compressed streams. As the objects are streamed
ordered by offset, the access pattern can be described as sequential.
It came at no surprise, that git would verify my test-pack (aggressively
packed git source repository with 137k objects, one 27mb pack) much
faster, i.e. 25%. After some profiling and optimizations, I could bring
it down to being just 15% faster. Considering that my cache, which only
got faster in the course of the optimizations, could speed up random
access by a factor of about 2.5, it was hard to understand that I
couldn't reach git's performance.
The major difference turned out to be the way the cache works. Git has a
small delta cache with only 256 [hardcoded] cache entries and a default
memory limit of 16mb. There it stores fully decompressed objects. It
maps objects to entries by hashing their pack offsets into the available
range of entries. When the pack is accessed sequentially, the cache will
be filled with related uncompressed objects, which can in turn reduce
the time required to apply the next delta by a huge amount, as only a
single delta has to be applied instead of a possibly long delta chain.
As git appears to pack deltas of related objects close to each other
(regarding their offset in the pack), the cache will be hit quite often
automatically. As the number of entries is small, and as entries are
connected using a doubly linked list, reclaiming of memory is rather
efficient, as it hits the first used objects first, which are unlikely
to be needed ever again. Collection doesn't necessarily run too often as
well, as most entries will be overwritten with new data during hash
collisions.
This cache implementation is clearly suitable for sequential access.
My cache was optimized for random access, hence it stores only base
objects and uncompressed delta streams, using many more entries to
achieve good cache hit ratios. The reason for the performance gain of
the random access cache was that it stores full objects. This fills the
cache memory up much faster, so having a lot of cache entries makes no
sense.
Both cache types are optimized for different kinds of access modes, and
both are required to efficiently deal with everything git usually has to do.
Hence I changed my cache to support both modes, and rerun the pack
verification test.
The result was better than expected, as the my implementation now takes
the lead by a tiny amount (25.3s vs. 26.0s) with a 16mb cache size. On
my way to make it even faster, I experimented with different cache
sizes, amounts of entries and of course different packs, which ranged
from 20mb to 600mb, which helped me fine tune the relations of these
variables.
In the end, with a cache of 27mb, my implementation took 20.6s, whereas
the git implementation could only improve slightly to finish after 25.3s.
I believe the cause of this is the fixed amount of entries. My cache
adjusts this amount depending on the packs size, the amount of objects,
as well as the size of the cache. In the this case, my cache would have
nearly 1000 entries, which helped to spread the amount of available
memory. Due to the limited amount of entries, git will not even benefit
from further increasing the size, whereas I could get as low as 13.5s by
increasing the cache size to 48mb for instance.
Just for the fun of it, I increased the amount of entries in the git
cache to the same amount my cache was using, and suddenly git was
performing equally well, finishing after just 20.8s with a 27mb cache size.
As my random access cache performed worse in sequential access mode, I
ran a test to see whether the opposite is true as well: Does the
sequential cache harm performance in random access mode ? The answer is:
Yes it does ! To show some numbers: 34mb of objects per second could be
streamed without cache, which was reduced to 28mb/s with a random access
cache. The cache in that case just causes overhead (especially when
reclaiming memory), and is hit just rarely.
To test my assumptions not only with my code, but also with git itself,
I used a test written for git-python, which streams blobs from the 27mb
git pack. With the default cache, I get 14mb/s. When I removed the
cache, it was upped to 15mb, which was less than expected, but we must
not forget the git-python overhead here. Finally, with the sequential
access cache enabled, its entries increased to 1000, and the cache size
upped to 27mb, suddenly I would get 34.2mb/s ! A new record, for
git-python at least ;).
As a final disclaimer, please let me emphasize that the tests I run are
neither statistically profound, nor are the pack verification tests
necessarily comparable in all details. Additionally, the git-python
object throughput tests cannot be directly compared to the c++ test
which has much less overhead. The tests were made to show performance
relations and uncover ways to improve performance, and not to claim that
one implementation is 'better' than the other.
-- Conclusions --
* delta cache needs to deal with random and sequential access.
* current implementation deals with sequential access only, which is
only suitable for pack verification, and in fact hurts performance in
other cases if the amount of entries (at least) is not dynamically
adjusted depending on parameters of the actual pack.
* random access caches work well with plenty of entries, when storing
only uncompressed deltas and base objects, as reapplying a delta is very
fast.
* Sequential access caches have to dynamically adjust their amount of
entries according to the amount of available cache memory and the
average packed object size, to make best use of the available memory.
* it should be possible to adjust the caching mode at runtime, or to
fully disable the cache.
* it might be useful/necessary to have one cache per pack sharing global
memory limits, instead of having one global cache, as caches need to be
adjusted depending on the actual pack.
In case anyone is interested in having a look at the way the I determine
the cache parameters (which really are the key to optimizing
performance), this is the line you would have to focus on:
https://github.com/Byron/gitplusplus/blob/deltastream/src/git/db/pack_file.cpp#L103.
The cache is used by the pack stream, whose core is in the
unpack_object_recursive method (equivalent to unpack_delta_entry in the
git source):
https://github.com/Byron/gitplusplus/blob/deltastream/src/git/db/pack_stream.cpp#L247
.
Kind Regards,
Sebastian
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html