Re: [RFC] Cache negative delta pairs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Jeff King <peff@xxxxxxxx> writes:

> From repack to repack, we end up trying to delta many of the same object
> pairs, which is computationally expensive.
>...
> I found this especially to be a problem with repos that consist of many
> large, unrelated files (e.g., photos). For example, on my test repo
> (about 300 unrelated 1-2M jpgs), a 'git-repack -a' takes about 10
> minutes to complete. With the delta cache, subsequent repacks take only
> 13 seconds. Results are not quite as dramatic for "normal" repos, but
> there is still some speedup. Repacking a fully packed linux-2.6 repo
> went from 1m12s to 36s. Repacking the git repo goes from 5.6s to 3.0s.

Interesting idea.  I think this matters more because for a
repository with many unrelated undeltifiable files, we do the
computation for objects that results in _no_ delta.  For normal
nearly fully packed repositories, once an object is deltified
against something else, subsequent repacking of the same set of
objects (or a superset thereof) will very likely reuse the delta
without recomputation, so as long as each object _can_ be
deltified with at least _one_ other object, you should not see
improvement on them.

So I am curious where the speed-up comes from for "normal" repos
in your experiments.  If it turns out that in "normal" repos the
objects that hit your negative cache are stored undeltified,
then that suggests that it might be worthwhile to consider using
a cache of "inherently undeltifiable objects", In other words, a
negative cache of O(N) entries, instead of O(N^2) entries,

Another interpretation of your result is that we may be using a
delta window that is unnecessarily too deep, and your negative
cache is collecting less optimum candidates that we attempt to
deltify against "just in case".  Can you easily instrument your
code to see where in the sorted delta candidate list the pairs
that hit your the negative cache are?  That is, in find_deltas()
function, we have "while (--j > 0)" loop that attempts to delta
with the entry that is j (modulo window size) entries away from
the current one, then j-1, j-2, ...; I am interested in the
distribution of "j" value for the pair "n,m" that hits your
negative cache for normal repositories, and I am speculating
that the value would probably be small relative to the delta
window size.

Another idea is to have a cache of "paths at which inherently
undeltifiable objects live in".  For example, we currently do
not delta OpenOffice documents (*.odt, *.odp, etc) very well.
If one has a repository that tracks the history of "file.odp",
we know each revision of "file.odp" would not delta against any
other version anyway, and could skip attempting to deltify them.

Your message contained string "*pt-in" in the commentary part
(replace asterisk with lowercase o) and was discarded by vger
mailing list software because that was a taboo word.  If you
would want to pursue this I would suggest to resend your
original patch after rephrasing that part.

>  - size. The cache is a packed sequence of binary sha1 pairs. I was
>    concerned that it would grow too large (obviously for n blobs you can
>    end up with n^2/2 entries), but it doesn't seem unreasonable for most
>    repos (either you don't have a lot of files, or if you do, they delta
>    reasonably well). My test repo's cache is only 144K. The git cache is
>    about 2.7M. The linux-2.6 cache is 22M.

The fully-packed object database is 6.2M pack so you are talking
about 40% bloat; the kernel is 115M so the overhead is 19%.

-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]