Re: [PATCH 09/10] fast-export: add a --show-original-ids option to show original names

Jeff King <peff@xxxxxxxx> · Mon, 12 Nov 2018 07:53:41 -0500

On Sun, Nov 11, 2018 at 12:32:22AM -0800, Elijah Newren wrote:

> > >  Documentation/git-fast-export.txt |  7 +++++++
> > >  builtin/fast-export.c             | 20 +++++++++++++++-----
> > >  fast-import.c                     | 17 +++++++++++++++++
> > >  t/t9350-fast-export.sh            | 17 +++++++++++++++++
> > >  4 files changed, 56 insertions(+), 5 deletions(-)
> >
> > The fast-import format is documented in Documentation/git-fast-import.txt.
> > It might need an update to cover the new format.
> 
> We document the format in both fast-import.c and
> Documentation/git-fast-import.txt?  Maybe we should delete the long
> comments in fast-import.c so this isn't duplicated?

Yes, that is probably worth doing (see the comment at the top of
fast-import.c). Some information might need to be migrated.

If we're going to have just one spot, I think it needs to be the
user-facing documentation. This is a public interface that other people
are building compatible implementations for (including your new tool).

> > > +--show-original-ids::
> > > +     Add an extra directive to the output for commits and blobs,
> > > +     `originally <SHA1SUM>`.  While such directives will likely be
> > > +     ignored by importers such as git-fast-import, it may be useful
> > > +     for intermediary filters (e.g. for rewriting commit messages
> > > +     which refer to older commits, or for stripping blobs by id).
> >
> > I'm not quite sure how a blob ends up being rewritten by fast-export (I
> > get that commits may change due to dropping parents).
> 
> It doesn't get rewritten by fast-export; it gets rewritten by other
> intermediary filters, e.g. in something like this:
> 
>    git fast-export --show-original-ids --all | intermediary_filter |
> git fast-import
> 
> The intermediary_filter program may want to strip out blobs by id, or
> remove filemodify and filedelete directives unless they touch certain
> paths, etc.

OK, that matches my understanding. So why does fast-export need to print
the blob ids? If the intermediary is rewriting blobs, it can then
produce the "originally" line itself, can't it?

The more interesting case I guess is your "strip out blobs by id"
example. There the intermediary _could_ do so itself, but it would
require recomputing the object id of each blob.

If you use "--no-data", then this just works (we specify tree entries by
object id, rather than by mark). But I can see how it would be useful to
have the information even without "--no-data" (i.e., if you are doing
multiple kinds of rewrites on a single stream).

I think the thing that confused me is that this "originally" is doing
two things:

  - mentioning blob ids as an optimization / convenience for the reader

  - mentioning rewritten commit (and presumably tag?) ids that were
    rewritten as part of a partial history export. I suppose even trees
    could be rewritten that way, too, but fast-import doesn't generally
    consider trees to be a first-class item.

So I'm OK with it, but I wonder if there is an easier way to explain it.

-Peff