Eric Sunshine <sunshine@xxxxxxxxxxxxxx> writes: > diff --git a/contrib/contacts/git-contacts b/contrib/contacts/git-contacts > new file mode 100755 > index 0000000..9007bae > --- /dev/null > +++ b/contrib/contacts/git-contacts > @@ -0,0 +1,121 @@ > +#!/usr/bin/perl > + > +# List people who might be interested in a patch. Useful as the argument to > +# git-send-email --cc-cmd option, and in other situations. > +# > +# Usage: git contacts <file> > + > +use strict; > +use warnings; > +use IPC::Open2; > + > +my $since = '5-years-ago'; > +my $min_percent = 10; > +my $labels_rx = qr/(?:Signed-off|Reviewed|Acked)-by/; Although I personally do not see particuarly a good reason to do so, I have seen people add "Cc:" on the footers, and I suspect they may expect them to be also picked up as "relevant parties" to the change. Also S-o-b is often misspelled as "Signed-Off-By", so you may want to do qr//i this one. > +my $id_rx = qr/[0-9a-f]{40}/i; On the other hand, we always mark the "this is a format-patch output" marker lines with lowercase hex, so you would want to lose 'i' from here. And you probably want to tighten it even more, perhaps like so: qr/^From ([0-9a-f]{40}) Mon Sep 17 00:00:00 2001$/ Of course, you wuold need to have a separate regular expression to parse "git blame --incremental/--porcelain" output that may read perhaps like so: qr/^([0-9a-f]{40})[ *](\d) (\d) (\d)$/ to pick up only the group header. The last \d is the number of lines that came from this guilty party, and it might become useful if we want to give weight to people based on line-count, not just number of commits. > +sub format_contact { > + my ($name, $email) = @_; > + return "$name <$email>"; > +} > + > +sub parse_commit { > + my ($commit, $data) = @_; > + my $contacts = $commit->{contacts}; > + my $inbody = 0; > + for (split(/^/m, $data)) { > + if (not $inbody) { > + if (/^author ([^<>]+) <(\S+)> .+$/) { > + $contacts->{format_contact($1, $2)} = 1; The author name and email can be grabbed from the "blame" output without doing this (and the result may be more robust), but you would need to read from the log message anyway, so I think this is OK. Note that the names and emails in blame output are sanitized via the mailmap mechanism, but "cat-file commit" will certainly not be. You may have to do the mapping yourself by reading the mailmap for the names and addresses you read from S-o-b: and friends. > + } elsif (/^$/) { > + $inbody = 1; > + } > + } elsif (/^$labels_rx:\s+([^<>]+)\s+<(\S+?)>$/o) { > + $contacts->{format_contact($1, $2)} = 1; > + } > + } > +} > + > +sub import_commits { > + my ($commits) = @_; > + return unless %$commits; > + my $pid = open2 my $reader, my $writer, qw(git cat-file --batch); Hmph. I vaguely recall that people wanted not to use open2/IPC::Open2 in other parts of the system. I think "cat-file --batch" is designed to behave on a regular bidi pipe, as long as the caller strictly does a ping-pong of issuing one request, flush (with an empty line) and always read the response, so if open2 becomes problematic, we could switch to regular pipes. > + for my $id (keys(%$commits)) { > + print $writer "$id\n"; > + my $line = <$reader>; > + if ($line =~ /^($id_rx) commit (\d+)/o) { > + my ($cid, $len) = ($1, $2); > + die "expected $id but got $cid" unless $id eq $cid; > + my $data; > + # cat-file emits newline after data, so read len+1 > + read $reader, $data, $len + 1; > + parse_commit($commits->{$id}, $data); > + } > + } > + close $reader; > + close $writer; > + waitpid($pid, 0); > + die "git-cat-file error: $?" if $?; > +} > + > +sub get_blame { > + my ($commits, $source, $start, $len, $from) = @_; > + $len = 1 unless defined($len); > + return if $len == 0; > + open my $f, '-|', > + qw(git blame --incremental -C -C), '-L', "$start,+$len", > + '--since', $since, "$from^", '--', $source or die; "--incremental" is meant for consumers that wants better latency, not necessarily better throughput, but I think this consumer does not present incremental result to the end user, so perhaps you would want to use "--porcelain" instead. > + while (<$f>) { > + if (/^$id_rx/o) { > + my $id = $&; > + $commits->{$id} = { id => $id, contacts => {} }; > + } > + } > + close $f; > +} > + > +sub scan_hunks { > + my ($commits, $id, $f) = @_; > + my $source; > + while (<$f>) { > + if (/^---\s+(\S+)/) { I wonder what happens to a patch that touches a file with SP in its path with this regular expression. As it is fairly clear that you are reading from format-patch output (the caller does not call this function if it did not see $id), perhaps you can tighten the prefix a bit more? I.e. something like: if (/^--- (.*)$/) > + $source = substr($1, 2) unless $1 eq '/dev/null'; (mental note) A creation patch is special-cased, but this does not allow anything that does not begin with a single-byte --src-prefix (e.g. "a/path/to/patched/file"). > + } elsif (/^@@ -(\d+)(?:,(\d+))?/ && $source) { (mental note) For each hunk of a patch that is not a creation patch, find the origin of the preimage. > + get_blame($commits, $source, $1, $2, $id); > + } An major issue (*) and a few minor issues (-) from the above observations: * A single patch may touch two or more paths. If the first one is to modify an existing file, its path is assigned to $source. Now, imagine that the second one is a creation patch. $source is not set to undef but is kept, and the code ends up trying to run blame on the first path with the range for the second path. Oops? This is one of the reasons why we shouldn't use statement modifiers lightly. I think the above should be more like: if (/^--- (.*)$) { $source = ($1 eq '/dev/null') ? undef : substr($1, 2); } elsif ... - If the patch were prepared with a non-standard src/dst-prefix, unconditional substr($1, 2) would call blame on a wrong (and likely to be nonexistent) path without a useful diagnosis (the invocation of "git blame" will likely die with "no such path 'tailofpathname' in $id"). One way to make it more robust may be to do something like this: if (/^--- /) { if (m{^--- (?:a/(.*)|/dev/null)$}) { $source = ($1 eq '/dev/null') ? undef : $1; } else { die "Cannot parse the patch file:$_"; } } elsif ... - Often a single patch touches more than one ranges in the same path. Depending on the size of the patch, it might be more efficient to run a single blame for a range that covers all the lines the patch touches while discarding irrelevant parts. A longer term improvement may be to extend "git blame" so that it can take more than one "-L n,m" ranges, but I think that is outside of the scope of this patch. > + } > +} > + > +sub commits_from_patch { > + my ($commits, $file) = @_; > + open my $f, '<', $file or die "read failure: $file: $!"; > + my $id; > + while (<$f>) { > + if (/^From ($id_rx) /o) { > + $id = $1; > + last; > + } > + } > + scan_hunks($commits, $id, $f) if $id; > + close $f; > +} > + > +exit 1 unless @ARGV == 1; > + > +my %commits; > +commits_from_patch(\%commits, $ARGV[0]); > +import_commits(\%commits); > + > +my %count_per_person; > +for my $commit (values %commits) { > + for my $contact (keys %{$commit->{contacts}}) { > + $count_per_person{$contact}++; > + } > +} > + > +my $ncommits = scalar(keys %commits); > +for my $contact (keys %count_per_person) { > + my $percent = $count_per_person{$contact} * 100 / $ncommits; > + next if $percent < $min_percent; > + print "$contact\n"; > +} -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html