Re: [PATCH 09/10] fast-export: add a --show-original-ids option to show original names

Elijah Newren <newren@xxxxxxxxx> · Mon, 12 Nov 2018 07:46:14 -0800

On Mon, Nov 12, 2018 at 4:53 AM Jeff King <peff@xxxxxxxx> wrote:
> On Sun, Nov 11, 2018 at 12:32:22AM -0800, Elijah Newren wrote:
>
> > > >  Documentation/git-fast-export.txt |  7 +++++++
> > > >  builtin/fast-export.c             | 20 +++++++++++++++-----
> > > >  fast-import.c                     | 17 +++++++++++++++++
> > > >  t/t9350-fast-export.sh            | 17 +++++++++++++++++
> > > >  4 files changed, 56 insertions(+), 5 deletions(-)
> > >
> > > The fast-import format is documented in Documentation/git-fast-import.txt.
> > > It might need an update to cover the new format.
> >
> > We document the format in both fast-import.c and
> > Documentation/git-fast-import.txt?  Maybe we should delete the long
> > comments in fast-import.c so this isn't duplicated?
>
> Yes, that is probably worth doing (see the comment at the top of
> fast-import.c). Some information might need to be migrated.
>
> If we're going to have just one spot, I think it needs to be the
> user-facing documentation. This is a public interface that other people
> are building compatible implementations for (including your new tool).

Okay, I'll work on that.

> OK, that matches my understanding. So why does fast-export need to print
> the blob ids? If the intermediary is rewriting blobs, it can then
> produce the "originally" line itself, can't it?
>
> The more interesting case I guess is your "strip out blobs by id"
> example. There the intermediary _could_ do so itself, but it would
> require recomputing the object id of each blob.
>
> If you use "--no-data", then this just works (we specify tree entries by
> object id, rather than by mark). But I can see how it would be useful to
> have the information even without "--no-data" (i.e., if you are doing
> multiple kinds of rewrites on a single stream).
>
> I think the thing that confused me is that this "originally" is doing
> two things:
>
>   - mentioning blob ids as an optimization / convenience for the reader
>
>   - mentioning rewritten commit (and presumably tag?) ids that were
>     rewritten as part of a partial history export. I suppose even trees
>     could be rewritten that way, too, but fast-import doesn't generally
>     consider trees to be a first-class item.
>
> So I'm OK with it, but I wonder if there is an easier way to explain it.

Yeah, I started out just needing to add the original oids for commits.
Once I added them there, I wondered whether someone would need them
for tags and blobs too (not trees since fast-import doesn't work with
those).  For blobs, it made sense as a small performance optimization
(when running without --no-data), as you pointed out.  I can't think
of a use for them in tags, but once I've included them in blobs and
commits it felt like I might as well include them there for
completeness.  So maybe my commit message should have been something
more like:

"""
Knowing the original names (hashes) of commits can sometimes enable
post-filtering that would otherwise be difficult or impossible.  In
particular, the desire to rewrite commit messages which refer to other
prior commits (on top of whatever other filtering is being done) is
very difficult without knowing the original names of each commit.

In addition, knowing the original names (hashes) of blobs can allow
filtering by blob-id without requiring re-hashing the content of the
blob, and is thus useful as a small optimization.

Once we add original ids for both commits and blobs, we may as well
add them for tags too for completeness.  Perhaps someone will have a
use for them.

This commit teaches a new --show-original-ids option to fast-export
which will make it add a 'original-oid <hash>' line to blob, commits,
and tags.  It also teaches fast-import to parse (and ignore) such
lines.
"""

?