Re: Feature request: provide a persistent IDs on a commit

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jul 19 2022, Stephen Finucane wrote:

> On Mon, 2022-07-18 at 20:50 +0200, Ævar Arnfjörð Bjarmason wrote:
>> On Mon, Jul 18 2022, Stephen Finucane wrote:
>> 
>> > ...to track evolution of a patch through time.
>> > 
>> > tl;dr: How hard would it be to retrofit an 'ChangeID' concept à la the 'Change-
>> > ID' trailer used by Gerrit into git core?
>> > 
>> > Firstly, apologies in advance if this is the wrong forum to post a feature
>> > request. I help maintain the Patchwork project [1], which a web-based tool that
>> > provides a mechanism to track the state of patches submitted to a mailing list
>> > and make sure stuff doesn't slip through the crack. One of our long-term goals
>> > has been to track the evolution of an individual patch through multiple
>> > revisions. This is surprisingly hard goal because oftentimes there isn't a whole
>> > lot to work with. One can try to guess whether things are the same by inspecting
>> > the metadata of the commit (subject, author, commit message, and the diff
>> > itself) but each of these metadata items are subject to arbitrary changes and
>> > are therefore fallible.
>> > 
>> > One of the mechanisms I've seen used to address this is the 'Change-ID' trailer
>> > used by Gerrit. For anyone that hasn't seen this, the Gerrit server provides a
>> > git commit hook that you can install locally. When installed, this appends a
>> > 'Change-ID' trailer to each and every commit message. In this way, the evolution
>> > of a patch (or a "change", in Gerrit parlance) can be tracked through time since
>> > the Change ID provides an authoritative answer to the question "is this still
>> > the same patch". Unfortunately, there are still some obvious downside to this
>> > approach. Not only does this additional trailer clutter your commit messages but
>> > it's also something the user must install themselves. While Gerrit can insist
>> > that this is installed before pushing a change, this isn't an option for any of
>> > the common forges nor is it something git-send-email supports.
>> 
>> git format-patch+send-email will send your trailers along as-is, how
>> doesn't it support Change-Id. Does it need some support that any other
>> made-up trailer doesn't?
>
> It supports sending the trailers, sure. What it doesn't support is insisting you
> send this specific trailer (Change-Id). Only Gerrit can do this (server side,
> thankfully, which means you don't need to ask all contributors to install this
> hook if you want to rely on it for tooling, CI, etc.).

Ah, it's still unclear to me what you're proposing here though. That
send-email always (generates?) or otherwise insists on the trailer, that
it can be configured ot add it?

That send-email have some "pre-send-email" hook? Something else?

I'd think for projects that care about this they're likely to have a
centralized enough workflow that it can be checked on the remote side,
whether that's some sanity check on the applier's "git am" pipeline, or
a "pre-receive" hook.

>> > I imagine most people working with mailing list based workflows have their own
>> > client side tooling to support this while software forges like GitHub and GitLab
>> > simply don't bother tracking version history between individual commits in a
>> > pull/merge request.
>> 
>> It's far from ideal, but at least GitLab shows a diff on a push to a MR,
>> including if it's force-pushed. I'm not sure about GitHub.
>
> GitHub does not. Simply piling multiple additional "fix" commits onto the PR
> branch results in a less horrible review experience since you can maintain
> context, alas at the cost of a rotten git log. We don't need to debate the pros
> and cons of the various forges though :)

Yes, I'm only mentioning it because it's worth looking at existing
"solutions" that are in use in the wild, however flawed those may be.

>> > IMO though, it would be fantastic if third party tools
>> > weren't necessary though. What I suspect we want is a persistent ID (or rather
>> > UUID) that never changes regardless of how many times a patch is cherry-picked,
>> > rebased, or otherwise modified, similar to the Author and AuthorDate fields.
>> > Like Author and AuthorDate, it would be part of the core git commit metadata
>> > rather than something in the commit message like Signed-Off-By or Change-ID.
>> > 
>> > Has such an idea ever been explored? Is it even possible? Would it be broadly
>> > useful?
>> 
>> This has come up a bunch of times. I think that the thing git itself
>> should be doing is to lean into the same notion that we use for tracking
>> renames. I.e. we don't, we analyze history after-the-fact and spot the
>> renames for you.
>
> Any idea where I'd find previous discussions on this? I did look, and the only
> proposal I found was an old one that seemed to suggest including the Change-Id
> commit-msg hook with git itself which is not what I'm suggesting here.

At the time I was punting on finding the links, and just working off
vague recollection, and hoping you'd go list spelunking.

But I since recalled some details, I think the most relevant thing is
this discussion about a "git evolve":

    https://lore.kernel.org/git/CAPL8ZivFmHqS2y+WmNR6faRMnuahiqwPVYsV99NiJ1QLHOs9fQ@xxxxxxxxxxxxxx/

Which I think you'll find useful, especially as mercurial has an
existing implementation. The wider context for that "git evolve" is (I
believe) people at Google who maintain Gerrit trying to "upstream" the
Change-Id.

Now, it hasn't landed in git.git, and it's been a few years, but going
through the details of why it fizzled out will be useful to you, if
you're interested in driving something like this forward.

There's also these two proposals from Eric Raymond:

	https://lore.kernel.org/git/20190515191605.21D394703049@xxxxxxxxxxxxxxxxx/
	https://lore.kernel.org/git/20190521013250.3506B470485F@xxxxxxxxxxxxxxxxx/

Which I'm linking to here not because I think they're viable, as you can
see from my participation in those threads I think what he suggested is
an architectural dead end as far as git is concerned.

But rather because it's conceptually adjacent (you could in principle
use nanosecond timestamps as a poor man's UUID), and much of the
follow-up discussion is about format changes in general, and if/when
those might be viable.

>> We have some of that in git already, as git-patch-id, and more recently
>> git-range-diff. Both are flawed in a bunch of ways, and it's easy to run
>> into edge cases where they don't spot something that they "should"
>> have. Where "should" exists in the mind of the user.
>
> That's a fair point and is of course what we (Patchwork) have to do currently.
> Patchwork can track relations between individual patches but doesn't attempt to
> generate these relations itself. Instead, we rely on third-party tooling. The
> PaStA tool was one such example of a tool that could do this [1]. I can't
> imagine a tool like Gerrit would ever work without this concept of an
> authoritative (and arbitrary) identifier to track a patch's identity through
> time, hence its reliance on the Change-Id trailer.

I haven't used Gerrit or Patchwork, so much of this is from ignorance on
that front, but I have spent a lot of time thinking about this in the
context of git in general.

I think as users of git go the git project itself makes very heavy use
of this, i.e. sequences of patches are substantially rewritten, split,
squashed etc. all the time, or even split into two or more sets of
submissions.

Having said all that I can't see how a Change-Id isn't a Bad Idea(TM)
for all the same reasons that pre-git SCMs file formats that track
renames explicitly were a bad idea.

I.e. yes you can come up with cases where that's "better" than what git
does, but they didn't handle splitting/merging files etc.

Similarly what happens when you have 3 patches each with their own
Change-Id and you split them into 4 patches. Is the Change-Id 1=1 or
1=many. I'm suggesting that you'd want a solution that can be many=many.

And also, that those many=many should be dynamically configurable and
inferred after the fact. E.g. range-diff will commits that are similar
enough that two authors with no knowledge of each other independently
came up with.

I think that range-diff is still lacking in a lot of ways, in particular:

 * It matches entire commits (log + diff) on a similarity score, I've
   often wanted a way to "weigh" it, so e.g. a matching hunk would have
   3x the matching score of a matching commit message.

   Now it often "gives up", you can give it a higher --creation-factor,
   but that's "global", so for a large range you'll often start
   including irrelevant things as well.

 * It only does 1=1 attribution, and e.g. currently can't find/represent
   a case where a commit with 3 hunks got split into two commits, with 2
   and 1 hunks, respectively. It'll (usually) show a diff to the new 2
   hunk commit, but the "new" 1 hunk will be shown as new.

   We could continue to drill down and find such "unattributed" hunks.

> Perhaps we could flip this on its head. What would be the _downsides_ of
> providing a persistent, arbitrary identifier on a commit similar to Author and
> AuthorDate fields? There's obviously some work involved in implementing it but
> assuming that was already done, what would break/be worse as a result?

That "Repository formats matter", to borrow a phrase from a classic post
about git[1]. Once you provide a way to do something it will be used,
and when that something has inherent limitations (think SCM rename
tracking) used to the exclusion of others.

You can't provide something like that as an opt-in and "upstream" it
without it inevetably trickling into a lot of areas of Git's UX.

To continue the rename example, now you can just re-arrange your source
tree and not worry about micro-managing it with "git mv" (in the "svn
mv" sense), git will figure it out after the fact.

That's a sinificant UX benefit, we can provide a *much simpler* UX as a
result.

What would be the harm of an optional "rename tracking" header? After
all the heuristic sometimes "fails".

The harm would be that if you really wanted to lean into that (even
optionally) you'd be forced to add that to all sorts of tooling, not
just the cheap convenience that is "git mv" currently.

Likewise everything from "cherry-pick" to "rebase" to "commit" would
inevitably have to learn some way to know about, carry forward and ask
the user about Change-Id's and their preservation. Don't you think so?

Otherwise they'd be much too easy to lose track of, and if they only
reason we did all that is because we didn't think enough about the "work
it out after" approach that would be a bad investment of time.

But I may be wrong about all of that, I think one thing that would
really help clarify this & similar proposals is if people pushing it
forward came up with some basic tests for it, i.e. just something like
a:

    series-v1/
    series-v2/

Where those two directories would be the "git format-patch" output (or
whatever) of two versions of a series that Gerrit or Patchwork are now
managing, along with some (plain text?) manual mapping of which things
in v1 correspond to v2.

We could then compare how that manual attribution performs v.s. trying
to find which things match (range-diff) afterwards.

1. https://keithp.com/blog/Repository_Formats_Matter/





[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux