Hi Josh, On Mon, 6 Feb 2017, Johannes Schindelin wrote: > as discussed at the GitMerge, I am trying to come up with tooling that > will allow for substantially less tedious navigation between the local > repository, the mailing list, and what ends up in the `pu` branch. I found a little bit more time last Friday to play with the cross-correlation between commits in `pu` and mails in public-inbox/git.git and it is worse than I previously assumed. Just as a reminder: my plan was to start developing tools that will ultimately help me as well as other contributors with the arcane mailing list model of patch submission. And my first target was the seemingly simple task of figuring out the mail corresponding to any given commit in `pu` (i.e. the mail that contained the patch, and whose mail thread is hence expected to have the entire patch review, and to which I would be expected to respond if I find a problem with that commit). And since it is all-too-common that the oneline is adjusted before applying the patch, the Subject:/oneline pair is not a good candidate to find matches. My next best guess was that the author date would not be touched, so the pair of Date: and authordate should make a good candidate. My initial finding was that this is not without problems, as some mails were sent with identical Date: lines (most likely due to bugs in the tools, e.g. the well-known and already fixed bug in git-am, and hence git-rebase, where it would apply all patches using the first patch's author date), and worse: some of those mails contained actual patch series that actually made it into Git's commit history. But those are not the only problems. For starters, I tried to cross-correlate *just* the commits that entered `pu` since one week ago (git rev-list --since=1.week.ago upstream/pu) with mails of the past month in the mailing list archive. One obvious caveat is that RFC 2822 is ambiguous when it comes to the date format. While it seems nice that you *can* write single-digit day numbers as single digit if you want, or with a leading zero, or with a leading space, it makes it impossible to get away with exact matching. I did not really want to complicate my research by parsing the dates and normalizing them to epoch + timezone, also because I wanted results quick, so I simply normalized the dates to have leading zeroes for single-digit day numbers, that seems to work for the moment). The first category of problematic commits come as no surprise: merges. We do not even have a way to represent them as mails. I simply excluded them from the remainder of this study. The second category should not be all that surprising, too: Junio often adjusts the release notes without sending those patches out for review. Those commits are: 363588f (### match next, Junio C Hamano 2017-02-17) 2076907 (Git 2.12-rc2, Junio C Hamano 2017-02-17) 076c053 (Hopefully the final batch of mini-topics before the final, Junio C Hamano 2017-02-16) ae86372 (Revert "reset: add an example of how to split a commit into two", Junio C Hamano 2017-02-16) d09b692 (A bit more for -rc2, Junio C Hamano 2017-02-15) There is a third category, and this one *does* come as a surprise to me. It appears that at least *some* patches' Date: lines are either ignored or overridden or changed on their way from the mailing list into Git's commit history. There was only one commit in that commit range: 3c0cb0c (read_loose_refs(): read refs using resolve_ref_recursively(), Michael Haggerty 2017-02-09) This one was committed with an author date "Thu, 09 Feb 2017 21:53:52 +0100" but it appears that there was no mail sent to the Git mailing list with that particular Date: header and the *actual* mail containing the patch was sent with a Date: header "Fri, 10 Feb 2017 12:16:19 +0100" (Message-ID: d8e906d969700acbca8dc717673d0a9cdc910f62.1486724698.git.mhagger@xxxxxxxxxxxx). It is labor-intensive, but possible to find the correlation manually in this case because the Subject: line has been left intact. However, this points to a serious problem with my approach: I try to re-create information that is actually not available (which Message-ID corresponds to a given commit name). Since that information is not available, it is quite possible that this information cannot be retrieved accurately (and Michael's commit demonstrates that this is not a merely theoretic consideration). I do not know that I can fix this on my side. > P.S.: I used public-inbox.org links instead of commit references to the > Git repository containing the mailing list archive, because the format > of said Git repository is so unfavorable that it was determined very > quickly in a discussion between Patrick Reynolds (GitHub) and myself > that it would put totally undue burden on GitHub to mirror it there > (compare also Carlos Nieto's talk at GitMerge titled "Top Ten Worst > Repositories to host on GitHub"). Since the main problem was the unfavorable commit history structure, I *think* that it may be possible to auto-process public-inbox.org/git.git into a frequently-rewritten branch that squashes all commits from past years into single, per-year commits (and the same for recent months, the past days, and a single commit accumulating the current day's commits) and that that may solve the problematic structure. The blob names would remain identical to what is on public-inbox, of course. Ciao, Johannes P.S.: The *mini* scripts I used were cat generate-date-index.sh <<\EOF #! /bin/sh cd public-inbox-git since_commit="$1" test -n "$since_commit" || since_commit=$(git rev-list --since=1.month.ago master --reverse | head -n 1) for sha1 in $(git diff --raw --no-abbrev $since_commit..master | cut -f 4 -d \ ) do printf '%s\t%s\n' \ "$(git cat-file blob $sha1 | sed -n \ -e 's/^Date:[ ]*\([^,]*,\) *\([1-9] .*\)/\1 0\2/p' \ -e 's/^Date:[ ]*\([^,]*,\) *\([0-9][0-9] .*\)/\1 \2/p' \ -e '/^$/q')" \ $sha1 done | less -S EOF to generate a file date-index.txt containing "date\tblob" pairs where the blob refers to the SHA-1 of the mail in public-inbox/git.git, and cat >match-pu.sh <<\EOF #! /bin/sh for commit in $(git rev-list --since=1.week.ago --no-merges upstream/pu) do date="$(git show -s --format=%aD $commit | sed 's/, \([1-9]\) /, 0\1 /')" # fix up Git's idea of RFC 2822 mail_id=$(grep "^$date" date-index.txt | sed 's/.* //') case "$mail_id" in '') echo "ERROR: no mail found for $commit (date $date)" >&2 git show -s --pretty='tformat:%h (%s, %an %ad)' --date=short \ $commit >&2 ;; *' '*) echo "ERROR: multiple candidates found for $commit ($mail_id)" >&2 ;; *) echo "$date $mail_id" ;; esac done EOF to try to match the author dates with the ones in date-index.txt. The obvious next improvement is to list also Message-ID in date-index.txt.