Cross-referencing the Git mailing list archive with their corresponding commits in `pu`

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Josh,

as discussed at the GitMerge, I am trying to come up with tooling that
will allow for substantially less tedious navigation between the local
repository, the mailing list, and what ends up in the `pu` branch.

That tooling would *still* not help lowering the barrier of entry for
contributing to Git by a lot, as it would *still* not address the problem
that mails sent from the most prevalent desktop mail client, as well as
mails sent from the most prevalent web mail client, are simply and
unceremoniously dropped. (This problem was acknowledged by quite a few
nods even at the Contributors' Summit...) But still, we decided to start
*somewhere* and this tooling is what we agreed on.

It is quite a bit harder going than I would like: as we have figured out,
the Subject: line is not a good way to link the commits with the original
mails containing the patches, as commit messages are modified before being
pushed often enough to make this a fragile matching.

So I thought maybe the From: line (from the body, if available, otherwise
from the header) in conjunction with the "Date:" header would work. But a
preliminary study shows that there are 336 From: + Date: combinations in
the Git mailing list archive that are not unique. 71 of these are shared
by three or more mails, even, and 9 are shared by more than 10 mails,
respectively. This is bad!

Unsurprisingly, the top 10 of these cases were obviously caused by the
builtin `git am` bug where it would not reset the author date properly.
Surprisingly, though, there were a few cases from 2005, too.

I had a quick look to find out what was the culprit (looking at the
17-strong patch series "Documentation fixes in response to my previous
listing" by Nikolai Weibull, but I am at a loss there: the mail claims to
be sent by git-send-email and the patches appear to be generated by
git-format-patch as of v0.99.9l, neither of which had a Date:-related bug
back in that time frame. My best guess is that the patches were mishandled
by a tool similar to rebase -i (which entered Git only at v1.5.3).

For details, see:
http://public-inbox.org/git/11340844841342-git-send-email-mailing-lists.git@xxxxxxxxxxxxxxxxxxxxxx/
(this is also an example where public-inbox' thread detection went utterly
wrong, including way too many mails in the "thread")

There was even a case of duplicated Date: headers in 2012. Now, this case
is very curious, as there have been 7 mails with identical Date: header,
but it was not a 6-strong patch series. Instead, it was a 4-strong patch
series that needed three iterations before it was accepted, and the
identical Date: header appears only in v2's patches (*not* in its cover
letter) and it *disappeared* in v3's 4/4, where it was set *back* by a
week (to the Date: it had in v1).

For details, see
http://public-inbox.org/git/cover.1354693001.git.Sebastian.Leske@xxxxxxxxxxx/
and
http://public-inbox.org/git/cover.1354324110.git.Sebastian.Leske@xxxxxxxxxxx/
and
http://public-inbox.org/git/b115a546fa783b4121d118bb8fdb9270443f90fa.1353691892.git.Sebastian.Leske@xxxxxxxxxxx/

This last example also demonstrates a very curious test case for a
different difficulty in trying to reconstruct lost correspondences: the
patch series was applied *twice*, independently of each other. First, on
the day v3 was submitted, it was applied on top of v1.8.1-rc0 (as commits
ee26a6e2b8..dd465ce66f), although it was not merged until v1.8.1-rc3. 22
days later, it was reapplied on top of maint so it could enter v1.8.0.3
(back then, Git still had "patchlevel" versions): c2999adcd5..008c208c2c.

As you can see, there is a many-to-many relationship here, even if you do
leave the *original* branch out of the picture entirely.

Will keep you posted,
Dscho

P.S.: I used public-inbox.org links instead of commit references to the
Git repository containing the mailing list archive, because the format of
said Git repository is so unfavorable that it was determined very quickly
in a discussion between Patrick Reynolds (GitHub) and myself that it would
put totally undue burden on GitHub to mirror it there (compare also Carlos
Nieto's talk at GitMerge titled "Top Ten Worst Repositories to host on
GitHub").



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]