Antoine Pelisse <apelisse@xxxxxxxxx> writes: >>> >>> +static int commit_rewrite_authors(struct strbuf *buf, const char *what, struct string_list *mailmap) >>> +{ >>> + char *author, *endp; >>> + size_t len; >>> + struct strbuf name = STRBUF_INIT; >>> + struct strbuf mail = STRBUF_INIT; >>> + struct ident_split ident; >>> + >>> + author = strstr(buf->buf, what); >>> + if (!author) >>> + goto error; >> >> This does not stop at the end of the header part and would match a >> random line in the log message that happens to begin with "author "; >> is this something we would worry about, or would we leave it to "fsck"? > > The only worrying case would be: > ... Yeah, that pretty much matches what I had in mind (the short answer: leave it to "git fsck"). >> We usually signal error by returning a negative integer. It does >> not matter too much in this case as no callers seem to check the >> return value from this function, though. > > Fixed, or would you rather see it `void` ? Just like you can take advantage of map_user() that signals the caller if it did anything to optimize this function, in the longer run, it may help the (future) callers of this function if it gave "I did something" vs "I left it intact". In the particular case of this function, the "error" cases fall into the latter (it merely explains why it left it intact, and there is no sensible error recovery the caller _could_ do in any case) and I think it is not necessary to differenciate between "Returned as-is because there is no mapping" and "Returned as-is because I couldn't parse the commit". So "return 0 when it didn't do anything, return 1 when it rewrote" feels good enough, at least to me. >>> + } >>> + >>> + strbuf_add(&name, ident.name_begin, ident.name_end - ident.name_begin); >>> + strbuf_add(&mail, ident.mail_begin, ident.mail_end - ident.mail_begin); >>> + >>> + map_user(mailmap, &mail, &name); >>> + >>> + strbuf_addf(&name, " <%s>", mail.buf); >>> + >>> + strbuf_splice(buf, ident.name_begin - buf->buf, >>> + ident.mail_end - ident.name_begin + 1, >>> + name.buf, name.len); >> >> Would it give us better performance if we splice only when >> map_user() tells us that we actually rewrote the ident? > > My intuition was that the cost of splice belongs to "memoving", when the > size is different. Yet, Fixed, as it removes two copies. Thanks. I wonder if we can further restructure the code so that it first inspects the existing buffer to see if it even needs to copy the original commit buffer into a "strbuf only for grepping". If that can be easily done, then we will save even more copying, I think. The reason I alluded to revamping the grep API to get rid of the use of "header grep" mode in this codepath was exactly that. We could: - change the command line parser for --author= and --committer= so that these do not become part of the main "grep" expression. Instead we keep them as separate grep expressions (one "author" expression that OR'es the --author= options together, the other for the --committer= options); - in this codepath, inspect the "author" and "committer" in the commit object buffer, map them if necessary via the mailmap mechanism into temporary buffers (that is different from the "buf" in the commit_match() function), then run grep_buffer() with the author and committer grep expressions we separated in the previous step. Then we combine the results from "author" and "committer" grep and the main grep_buffer() result ourselves in this function. That may essentially amount to going in the totally opposite direction from what 2d10c55 (git log: Unify header_filter and message_filter into one., 2006-09-20) attempted to do. We used to have two grep expressions (one for header, the other one for body) commit_match() runs grep_buffer() on and combined the results. 2d10c55 merged them into one grep expression by introducing a term that matches only header elements. But we would instead split the "header" expression into "author" and "committer" expressions (making it three from one) if we go the above route. That would eliminate the need to copy and rewrite the contents of the commit object in this codepath, which may be a big win when names and emails that need to be rewritten are minority cases. But I suspect that is a much larger change. If we can reduce the amount of copies necessary without changing the code structure, that may be enough to reduce the performance hit from this change. Thanks. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html