Re: [PATCH/RFC 1/4] contrib: add git-contacts helper

Eric Sunshine <sunshine@xxxxxxxxxxxxxx> · Tue, 2 Jul 2013 04:17:14 -0400

On Mon, Jul 1, 2013 at 2:39 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
> Eric Sunshine <sunshine@xxxxxxxxxxxxxx> writes:
>
>> diff --git a/contrib/contacts/git-contacts b/contrib/contacts/git-contacts
>> new file mode 100755
>> index 0000000..9007bae
>> --- /dev/null
>> +++ b/contrib/contacts/git-contacts
>> @@ -0,0 +1,121 @@
>> +#!/usr/bin/perl
>> +
>> +# List people who might be interested in a patch.  Useful as the argument to
>> +# git-send-email --cc-cmd option, and in other situations.
>> +#
>> +# Usage: git contacts <file>
>> +
>> +use strict;
>> +use warnings;
>> +use IPC::Open2;
>> +
>> +my $since = '5-years-ago';
>> +my $min_percent = 10;
>> +my $labels_rx = qr/(?:Signed-off|Reviewed|Acked)-by/;
>
> Although I personally do not see particuarly a good reason to do so,
> I have seen people add "Cc:" on the footers, and I suspect they may
> expect them to be also picked up as "relevant parties" to the
> change.  Also S-o-b is often misspelled as "Signed-Off-By", so you
> may want to do qr//i this one.

I originally used /i but bogusly discarded it due to too narrow
thinking. Will re-add.

Agreed about adding Cc:.

>> +my $id_rx = qr/[0-9a-f]{40}/i;
>
> On the other hand, we always mark the "this is a format-patch
> output" marker lines with lowercase hex, so you would want to lose
> 'i' from here.

Indeed, all the SHA-1's matched by $id_rx in the script are
git-generated and thus lowercase hex. Will change.

> And you probably want to tighten it even more, perhaps
> like so:
>
>         qr/^From ([0-9a-f]{40}) Mon Sep 17 00:00:00 2001$/
>
> Of course, you wuold need to have a separate regular expression to
> parse "git blame --incremental/--porcelain" output that may read
> perhaps like so:
>
>         qr/^([0-9a-f]{40})[ *](\d) (\d) (\d)$/
>
> to pick up only the group header.  The last \d is the number of
> lines that came from this guilty party, and it might become useful
> if we want to give weight to people based on line-count, not just
> number of commits.

Can do.

>> +sub format_contact {
>> +     my ($name, $email) = @_;
>> +     return "$name <$email>";
>> +}
>> +
>> +sub parse_commit {
>> +     my ($commit, $data) = @_;
>> +     my $contacts = $commit->{contacts};
>> +     my $inbody = 0;
>> +     for (split(/^/m, $data)) {
>> +             if (not $inbody) {
>> +                     if (/^author ([^<>]+) <(\S+)> .+$/) {
>> +                             $contacts->{format_contact($1, $2)} = 1;
>
> The author name and email can be grabbed from the "blame" output
> without doing this (and the result may be more robust), but you
> would need to read from the log message anyway, so I think this is
> OK.
>
> Note that the names and emails in blame output are sanitized via the
> mailmap mechanism, but "cat-file commit" will certainly not be.

Thanks for pointing this out. Grabbing the author name and email from
git-blame output does seem like a better approach.

> You may have to do the mapping yourself by reading the mailmap for
> the names and addresses you read from S-o-b: and friends.

Felipe did introduce mailmap support in v4 but dropped it at about v6.
It seems worthwhile, but I first wanted to duplicate the basic
functionality of his v9, and figured that mailmap support could be
added in a follow-up patch.

v4: http://article.gmane.org/gmane.comp.version-control.git/222439/
v6: http://thread.gmane.org/gmane.comp.version-control.git/224896/

>> +sub import_commits {
>> +     my ($commits) = @_;
>> +     return unless %$commits;
>> +     my $pid = open2 my $reader, my $writer, qw(git cat-file --batch);
>
> Hmph.
>
> I vaguely recall that people wanted not to use open2/IPC::Open2 in
> other parts of the system.
>
> I think "cat-file --batch" is designed to behave on a regular bidi
> pipe, as long as the caller strictly does a ping-pong of issuing one
> request, flush (with an empty line) and always read the response, so
> if open2 becomes problematic, we could switch to regular pipes.

Checking the log, I see several cases where deprecated Python popen2
was removed but find nothing mentioning Perl. Git.pm,
git-archimport.perl and git-cvsimport.perl appear to use open2.
Switching to pipes is certainly an option.

>> +sub get_blame {
>> +     my ($commits, $source, $start, $len, $from) = @_;
>> +     $len = 1 unless defined($len);
>> +     return if $len == 0;
>> +     open my $f, '-|',
>> +             qw(git blame --incremental -C -C), '-L', "$start,+$len",
>> +             '--since', $since, "$from^", '--', $source or die;
>
> "--incremental" is meant for consumers that wants better latency,
> not necessarily better throughput, but I think this consumer does
> not present incremental result to the end user, so perhaps you would
> want to use "--porcelain" instead.

This use of --incremental was copied literally from Felipe's v9, but
after reading git-blame documentation, I agree that --porcelain makes
sense.

>> +sub scan_hunks {
>> +     my ($commits, $id, $f) = @_;
>> +     my $source;
>> +     while (<$f>) {
>> +             if (/^---\s+(\S+)/) {
>
> I wonder what happens to a patch that touches a file with SP in its
> path with this regular expression.  As it is fairly clear that you
> are reading from format-patch output (the caller does not call this
> function if it did not see $id), perhaps you can tighten the prefix
> a bit more?  I.e. something like:
>
>         if (/^--- (.*)$/)

Good idea.

>> +             } elsif (/^@@ -(\d+)(?:,(\d+))?/ && $source) {
>
> (mental note) For each hunk of a patch that is not a creation patch,
> find the origin of the preimage.
>
>> +                     get_blame($commits, $source, $1, $2, $id);
>> +             }
>
> An major issue (*) and a few minor issues (-) from the above
> observations:
>
>  * A single patch may touch two or more paths.  If the first one is
>    to modify an existing file, its path is assigned to $source.
>    Now, imagine that the second one is a creation patch.  $source is
>    not set to undef but is kept, and the code ends up trying to run
>    blame on the first path with the range for the second path.
>
>    Oops?
>
>    This is one of the reasons why we shouldn't use statement
>    modifiers lightly.  I think the above should be more like:
>
>         if (/^--- (.*)$) {
>                 $source = ($1 eq '/dev/null') ? undef : substr($1, 2);
>         } elsif ...

This error is entirely mine. Felipe's script did it correctly.

>  - If the patch were prepared with a non-standard src/dst-prefix,
>    unconditional substr($1, 2) would call blame on a wrong (and
>    likely to be nonexistent) path without a useful diagnosis (the
>    invocation of "git blame" will likely die with "no such path
>    'tailofpathname' in $id").
>
>    One way to make it more robust may be to do something like this:
>
>         if (/^--- /) {
>                 if (m{^--- (?:a/(.*)|/dev/null)$}) {
>                         $source = ($1 eq '/dev/null') ? undef : $1;
>                 } else {
>                         die "Cannot parse the patch file:$_";
>                 }
>         } elsif ...

The substr($1, 2) also bothered me, but I didn't immediately see a
good solution. Aborting, as you suggest, seems reasonable.

>  - Often a single patch touches more than one ranges in the same
>    path.  Depending on the size of the patch, it might be more
>    efficient to run a single blame for a range that covers all the
>    lines the patch touches while discarding irrelevant parts.  A
>    longer term improvement may be to extend "git blame" so that it
>    can take more than one "-L n,m" ranges, but I think that is
>    outside of the scope of this patch.

Sounds like a potentially good optimization for a future patch, though
it's not clear what heuristic to use to decide when to combine the
ranges for a single git-blame run. Extending git-blame to recognize
multiple -L sounds even better, though definitely outside the scope of
this series.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html