Re: [PATCH 1/2] log: grep author/committer using mailmap

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Antoine Pelisse <apelisse@xxxxxxxxx> writes:

>>>
>>> +static int commit_rewrite_authors(struct strbuf *buf, const char *what, struct string_list *mailmap)
>>> +{
>>> +     char *author, *endp;
>>> +     size_t len;
>>> +     struct strbuf name = STRBUF_INIT;
>>> +     struct strbuf mail = STRBUF_INIT;
>>> +     struct ident_split ident;
>>> +
>>> +     author = strstr(buf->buf, what);
>>> +     if (!author)
>>> +             goto error;
>>
>> This does not stop at the end of the header part and would match a
>> random line in the log message that happens to begin with "author ";
>> is this something we would worry about, or would we leave it to "fsck"?
>
> The only worrying case would be:
> ...

Yeah, that pretty much matches what I had in mind (the short answer:
leave it to "git fsck").

>> We usually signal error by returning a negative integer.  It does
>> not matter too much in this case as no callers seem to check the
>> return value from this function, though.
>
> Fixed, or would you rather see it `void` ?

Just like you can take advantage of map_user() that signals the
caller if it did anything to optimize this function, in the longer
run, it may help the (future) callers of this function if it gave "I
did something" vs "I left it intact".  In the particular case of
this function, the "error" cases fall into the latter (it merely
explains why it left it intact, and there is no sensible error
recovery the caller _could_ do in any case) and I think it is not
necessary to differenciate between "Returned as-is because there is
no mapping" and "Returned as-is because I couldn't parse the
commit".

So "return 0 when it didn't do anything, return 1 when it rewrote"
feels good enough, at least to me.

>>> +     }
>>> +
>>> +     strbuf_add(&name, ident.name_begin, ident.name_end - ident.name_begin);
>>> +     strbuf_add(&mail, ident.mail_begin, ident.mail_end - ident.mail_begin);
>>> +
>>> +     map_user(mailmap, &mail, &name);
>>> +
>>> +     strbuf_addf(&name, " <%s>", mail.buf);
>>> +
>>> +     strbuf_splice(buf, ident.name_begin - buf->buf,
>>> +                   ident.mail_end - ident.name_begin + 1,
>>> +                   name.buf, name.len);
>>
>> Would it give us better performance if we splice only when
>> map_user() tells us that we actually rewrote the ident?
>
> My intuition was that the cost of splice belongs to "memoving", when the
> size is different. Yet, Fixed, as it removes two copies.

Thanks.

I wonder if we can further restructure the code so that it first
inspects the existing buffer to see if it even needs to copy the
original commit buffer into a "strbuf only for grepping".  If that
can be easily done, then we will save even more copying, I think.

The reason I alluded to revamping the grep API to get rid of the use
of "header grep" mode in this codepath was exactly that.  We could:

 - change the command line parser for --author= and --committer= so
   that these do not become part of the main "grep" expression.
   Instead we keep them as separate grep expressions (one "author"
   expression that OR'es the --author= options together, the other
   for the --committer= options);

 - in this codepath, inspect the "author" and "committer" in the
   commit object buffer, map them if necessary via the mailmap
   mechanism into temporary buffers (that is different from the
   "buf" in the commit_match() function), then run grep_buffer()
   with the author and committer grep expressions we separated in
   the previous step. Then we combine the results from "author" and
   "committer" grep and the main grep_buffer() result ourselves in
   this function.

That may essentially amount to going in the totally opposite
direction from what 2d10c55 (git log: Unify header_filter and
message_filter into one., 2006-09-20) attempted to do.  We used to
have two grep expressions (one for header, the other one for body)
commit_match() runs grep_buffer() on and combined the results.
2d10c55 merged them into one grep expression by introducing a term
that matches only header elements.  But we would instead split the
"header" expression into "author" and "committer" expressions
(making it three from one) if we go the above route.

That would eliminate the need to copy and rewrite the contents of
the commit object in this codepath, which may be a big win when
names and emails that need to be rewritten are minority cases.

But I suspect that is a much larger change.  If we can reduce the
amount of copies necessary without changing the code structure, that
may be enough to reduce the performance hit from this change.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]