Re: [PATCH] Adding a cache of commit to patch-id pairs to speed up git-cherry

"Geoffrey Irving" <irving@xxxxxxx> · Mon, 2 Jun 2008 07:50:52 -0700

On Sun, Jun 1, 2008 at 11:13 PM, Johannes Schindelin
<Johannes.Schindelin@xxxxxx> wrote:
> Hi,
>
> On Sun, 1 Jun 2008, Geoffrey Irving wrote:
>
>> The dominant cost of git-cherry is the computation of patch-ids for each
>> relevant commit.  Once computed, these pairs are now stored in a hash
>> table in $GIT_DIR/patch-id-cache to speed up repeated invocations.
>>
>> The basic structure of patch-id-cache.c was cannibalized from Johannes
>> Schindelin's notes-index structure, though most of the code was
>> rewritten. The hash table is still kept strictly sorted by commit, but
>> the entire table is now read into memory.
>
> I do not think that this "read-the-entire-table-into-memory" paradigm is a
> wise choice. mmap()ing, I would have understood, but reading a potentially
> pretty large table into memory?

I'll switch to mmapping.

>> See commit d56dbe8cb56ce9b4697d1f1c2425e2e133a932a5 for the original
>> code.
>
> This is not in any official git.git repository.  And my branches are prone
> to be rebased.  So this is not a good reference.  The mailing list,
> however, would be one.

That was added only because I knew it wouldn't be the final patch, and
that it'd be useful to you.

>> diff --git a/Makefile b/Makefile
>> index cce5a6e..3a5396d 100644
>> --- a/Makefile
>> +++ b/Makefile
>> @@ -435,6 +435,7 @@ LIB_OBJS += pager.o
>>  LIB_OBJS += parse-options.o
>>  LIB_OBJS += patch-delta.o
>>  LIB_OBJS += patch-ids.o
>> +LIB_OBJS += patch-id-cache.o
>
> If all you do is a hashmap from SHA-1 to SHA-1, then I think
> "patch-id-cache" is a misnomer for that file and structure.
>
> Let's not repeat the same naming mistakes as we have for path_list and
> decoration.

Is persistent_sha1_map a good name?

>> +static void read_patch_id_cache()
>> +{
>> +     int fd = open(git_path("patch-id-cache"), O_RDONLY);
>> +
>> +     if (fd < 0) { // allocate a new cache
>> +             size_t size = 64;
>> +             cache = xcalloc(get_hash_size(size), 1);
>> +             cache->key_size = cache->actual_size = size;
>> +             return;
>> +     }
>> +
>> +     struct patch_id_cache header;
>> +     size_t len = read_in_full(fd, &header, sizeof(header));
>> +     if (len != sizeof(header))
>> +             die("cannot read patch-id-cache header");
>> +     if (header.key_size & (header.key_size-1))
>> +             die("patch-id-cache key size %lld is not a power of two", (long
>> long)header.key_size);
>
> Very long line.
>
> Why does the key_size have to be a power of two?

It doesn't, but the current code creates it that way, so that was a
sanity check.  Well, actually the current code uses & instead of %, so
it does require it unless I switch back to %.

>> +     free(cache);
>> +     cache = xmalloc(get_hash_size(header.actual_size));
>> +     *cache = header;
>
> This free() kind of defies the xmalloc in the if() clause above.
>
> Besides, those three lines are a pretty awkward way to avoid writing
> xrealloc(), don't you think?
>
>> +     size_t entry_size = header.actual_size * sizeof(struct patch_id_entry);
>
> Declaration after statement.
>
>> +     len = read_in_full(fd, cache->entries, entry_size);
>> +     if (len != entry_size)
>> +             die("cannot read patch-id-cache entries");
>> +     written_count = cache->count;
>
> Huh?  It is not written_count.  Besides, why do you use a global here?
> (Yes, I know, you probably just copied it from my code, but you could
> clean it up in the process ;-)
>
>> +     // It's impossible to check whether the cache is actually accurate,
>
> C++ style comments.
>
>> +void write_patch_id_cache()
>> +{
>> +     if (!cache || cache->count == written_count)
>> +             return;
>
> Does that mean that the patch_id_cache is not updated when the number
> of commits stays the same after committing, branch -d and gc?

The patch_id_cache is only updated when it changes.  Since entries are
immutable and are never deleted, all changes increase the count.

>> +     struct lock_file update_lock;
>> +     int fd = hold_lock_file_for_update(&update_lock,
>> +                     git_path("patch-id-cache"), 0);
>> +
>> +     if (fd < 0) {
>> +             error("could not construct patch-id-cache");
>> +             return;
>> +     }
>
> Not good.  Printing an error, but not returning any indicator of it.
>
>> +int get_cached_commit_patch_id(const unsigned char *commit_sha1,
>> +             unsigned char *patch_id_sha1)
>
> Again, I do not think it is wise to name this function so focused on this
> particular purpose, when its use is clearly not limited to patch ids.
>
>> +void cache_commit_patch_id(const unsigned char *commit_sha1,
>> +             const unsigned char *patch_id_sha1)
>> +{
>> +     if (4*cache->count > 3*cache->key_size) { // resize
>
> C++ comment.
>
>> +             struct patch_id_cache *old_cache = cache;
>> +             size_t new_key_size = 2*cache->key_size,
>> +                     new_actual_size = cache->key_size + cache->actual_size;
>> +             cache = xcalloc(get_hash_size(new_actual_size), 1);
>> +             cache->key_size = new_key_size;
>> +             cache->actual_size = new_actual_size;
>> +
>> +             // reinsert all entries
>> +             size_t i;
>> +             for (i = 0; i < old_cache->actual_size; i++)
>> +                     cache_commit_patch_id(old_cache->entries[i].commit_sha1,
>> +                             old_cache->entries[i].patch_id_sha1);
>> +     }
>
> Does this not just _beg_ to live in an own function ("grow_hash")?
>
>> +     struct patch_id_entry entry;
>
> Declaration after statement.
>
>> +     hashcpy(entry.commit_sha1, commit_sha1);
>> +     hashcpy(entry.patch_id_sha1, patch_id_sha1);
>
> It would be more elegant to copy the SHA-1s _after_ finding where to write
> them.

Alas, that would break the elegance of my loop, since the loop swaps
in new entries to keep the table sorted.  I can remove the sorting if
you want: I only left it in there to be more similar to your code.

>> +     size_t i = get_hash_index(commit_sha1);
>
> Declaration after statement.
>
>> +     for (;;) {
>> +             if (i == cache->actual_size) {
>> +                     cache->actual_size++; // adding one suffices for expected O(1)
>
> Long line.  C++ comment.  "one suffices"?
>
>> +                     cache = xrealloc(cache, get_hash_size(cache->actual_size));
>> +                     cache->entries[cache->actual_size-1] = entry;
>> +                     return;
>> +             }
>> +             int cmp = hashcmp(entry.commit_sha1, cache->entries[i].commit_sha1);
>
> Declar... ah, well, suffice to say that you should read the
> CodingGuidelines, and try to fix up all the offending sites in your code.

I'd be happy to do that, but I don't see mention of either C++
comments or declarations after statements in the CodingGuidelines.

>> +             if (!cmp) {
>> +                     if (hashcmp(entry.patch_id_sha1, cache->entries[i].patch_id_sha1))
>> +                             die("found mismatched entry in patch-id-cache");
>
> I wonder if that potentially expensive operation should not rather be
> wrapped in an assert() call (since I recently learnt that Git's source
> code has more than one instance of assert()).

A 20 byte comparison doesn't seem potentially expensive to me.

>> +                     return;
>> +             } else if (cmp < 0) {
>> +                     i++;
>> +             } else if (is_null_sha1(cache->entries[i].commit_sha1)) {
>> +                     cache->entries[i] = entry;
>> +                     cache->count++;
>> +                     return;
>> +             } else {
>> +                     struct patch_id_entry tmp = cache->entries[i];
>> +                     cache->entries[i] = entry;
>> +                     entry = tmp;
>> +                     i++;
>> +             }
>
> You seem to be wanting to implement a "sorted hashmap"... this is
> non-standard, and AFAICT affects performance negatively.

I'll strip it out.  I actually stripped it out initially, and then
added it back to mirror your code.

>> diff --git a/patch-ids.c b/patch-ids.c
>> index 3be5d31..91fabeb 100644
>> --- a/patch-ids.c
>> +++ b/patch-ids.c
>> @@ -3,16 +3,30 @@
>>  #include "commit.h"
>>  #include "patch-ids.h"
>>
>> +extern void write_patch_id_cache();
>> +extern int get_cached_commit_patch_id(const unsigned char
>> *commit_sha1, unsigned char *patch_id_sha1);
>> +extern void cache_commit_patch_id(const unsigned char *commit_sha1,
>> const unsigned char *patch_id_sha1);
>> +
>
> Uh oh.  This is not good.  If you do not have declarations like these in
> header files (that are included from the implementing .c file, of course),
> it is all too easy to have the signatures of the implementation and the
> declaration diverge.
>
> So move them into an own header file, where they belong.  IIRC I did this
> in the notes code, so why did you choose not to imitate that?

No good reason.

>>  static uint32_t take2(const unsigned char *id)
>> @@ -136,6 +150,8 @@ int free_patch_ids(struct patch_ids *ids)
>>               next = patches->next;
>>               free(patches);
>>       }
>> +
>> +     write_patch_id_cache();
>>       return 0;
>
> That's cute.

Thanks (assuming cute means good).

I'll wait to actually make any of these changes until you and Jeff
decide whether I should stick to hashes or use binary search.  It
seems sad not to use hashes for a map when we get the best hash keys
in the world for free, so I'd prefer not switching.

Geoffrey
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html