Re: [PATCH] Adding a cache of commit to patch-id pairs to speed up git-cherry

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Mon, 2 Jun 2008 16:52:32 +0100 (BST)

Hi,

On Mon, 2 Jun 2008, Geoffrey Irving wrote:

> On Sun, Jun 1, 2008 at 11:13 PM, Johannes Schindelin
> <Johannes.Schindelin@xxxxxx> wrote:
>
> > On Sun, 1 Jun 2008, Geoffrey Irving wrote:
> >
> >> See commit d56dbe8cb56ce9b4697d1f1c2425e2e133a932a5 for the original 
> >> code.
> >
> > This is not in any official git.git repository.  And my branches are 
> > prone to be rebased.  So this is not a good reference.  The mailing 
> > list, however, would be one.
>
> That was added only because I knew it wouldn't be the final patch, and 
> that it'd be useful to you.

Ah, then you might want to say "[PATCH/RFC]" in the subject.  I thought 
you meant this for inclusion.

> >> diff --git a/Makefile b/Makefile
> >> index cce5a6e..3a5396d 100644
> >> --- a/Makefile
> >> +++ b/Makefile
> >> @@ -435,6 +435,7 @@ LIB_OBJS += pager.o
> >>  LIB_OBJS += parse-options.o
> >>  LIB_OBJS += patch-delta.o
> >>  LIB_OBJS += patch-ids.o
> >> +LIB_OBJS += patch-id-cache.o
> >
> > If all you do is a hashmap from SHA-1 to SHA-1, then I think
> > "patch-id-cache" is a misnomer for that file and structure.
> >
> > Let's not repeat the same naming mistakes as we have for path_list and
> > decoration.
> 
> Is persistent_sha1_map a good name?

I'd prefer cached_sha1_map, but I do not really care.

> >> +void write_patch_id_cache()
> >> +{
> >> +     if (!cache || cache->count == written_count)
> >> +             return;
> >
> > Does that mean that the patch_id_cache is not updated when the number 
> > of commits stays the same after committing, branch -d and gc?
> 
> The patch_id_cache is only updated when it changes.  Since entries are 
> immutable and are never deleted, all changes increase the count.

Right, I only realized that with my previous post to this thread.  
However, in that case I would prefer a flag "int dirty:1" to the cache 
structure, which is set whenever it needs writing.  Certainly not a global 
variable (which the cache already is).

> >> +     hashcpy(entry.commit_sha1, commit_sha1);
> >> +     hashcpy(entry.patch_id_sha1, patch_id_sha1);
> >
> > It would be more elegant to copy the SHA-1s _after_ finding where to 
> > write them.
> 
> Alas, that would break the elegance of my loop, since the loop swaps in 
> new entries to keep the table sorted.  I can remove the sorting if you 
> want: I only left it in there to be more similar to your code.

I think that my code tried to do too much here; it was before I benched 
hashmap vs binary-search.

So I think that can just go and simplify the elegance of the loop even 
more ;-)

> > Declar... ah, well, suffice to say that you should read the 
> > CodingGuidelines, and try to fix up all the offending sites in your 
> > code.
> 
> I'd be happy to do that, but I don't see mention of either C++
> comments or declarations after statements in the CodingGuidelines.

>From the Coding Guidelines:

	As for more concrete guidelines, just imitate the existing code
	(this is a good guideline, no matter which project you are
	contributing to).

> 
> >> +             if (!cmp) {
> >> +                     if (hashcmp(entry.patch_id_sha1, cache->entries[i].patch_id_sha1))
> >> +                             die("found mismatched entry in patch-id-cache");
> >
> > I wonder if that potentially expensive operation should not rather be 
> > wrapped in an assert() call (since I recently learnt that Git's source 
> > code has more than one instance of assert()).
> 
> A 20 byte comparison doesn't seem potentially expensive to me.

It's all a question of repetition, isn't it?

In any case, I am not a fan of wasting cycles unnecessarily.  The check 
would indicate a programming error, not a user error, and should therefore 
punish the programmer, not the user.

> >>  static uint32_t take2(const unsigned char *id)
> >> @@ -136,6 +150,8 @@ int free_patch_ids(struct patch_ids *ids)
> >>               next = patches->next;
> >>               free(patches);
> >>       }
> >> +
> >> +     write_patch_id_cache();
> >>       return 0;
> >
> > That's cute.
> 
> Thanks (assuming cute means good).

Yes, it does!

> I'll wait to actually make any of these changes until you and Jeff 
> decide whether I should stick to hashes or use binary search.  It seems 
> sad not to use hashes for a map when we get the best hash keys in the 
> world for free, so I'd prefer not switching.

Me, too.

BTW I think that you did some awesome work here, and I am glad that you 
could use my code as starting point.

Thanks,
Dscho

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html