Re: [PATCH] diffcore-rename: cache file deltas

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Jeff King <peff@xxxxxxxx> writes:

> We find rename candidates by computing a fingerprint hash of
> each file, and then comparing those fingerprints. There are
> inherently O(n^2) comparisons, so it pays in CPU time to
> hoist the (rather expensive) computation of the fingerprint
> out of that loop (or to cache it once we have computed it once).
>
> Previously, we didn't keep the filespec information around
> because then we had the potential to consume a great deal of
> memory. However, instead of keeping all of the filespec
> data, we can instead just keep the fingerprint.
>
> This patch implements and uses diff_free_filespec_data_large
> to accomplish that goal. We also have to change
> estimate_similarity not to needlessly repopulate the
> filespec data when we already have the hash.
>
> Practical tests showed 4.5x speedup for a 10% memory usage
> increase.
>
> Signed-off-by: Jeff King <peff@xxxxxxxx>

Very nice.

> The implementation is a little less nice than I would like, but I was
> trying to be non-invasive. Specifically:
>
>  - the name diff_free_filespec_data_large is horrible, but this is based
>    on the fact that diff_free_filespec_data actually does too much (it
>    frees the data _and_ some other auxiliary data). And renaming that
>    would entail changing many callsites.

True.  But we can rename it to diff_file_filespec_blob() and
that would perfectly well describe what it does.  Will do so
when applying if it is Ok to you.

>  - It seems that a better place to call diffcore_populate_filespec
>    (rather than in estimate_similarity) would actually be in
>    diffcore_count_changes (when we _know_ that we need to populate the
>    contents data).
>
>  - The hash_chars() should arguably be tied into
>    diffcore_populate_filespec, which should have more of a "what
>    information do you want" interface. I.e., the "size_only" parameter
>    could be extended to a bitfield to say "populate this, and I need the
>    delta fingerprint, size, actual contents, etc". Then callers could
>    just use "populate" before looking at the filespec and it would
>    lazily load whatever they needed.

Both good points, but I agree with you that it is wise to do
that with a follow-up patch.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux