Re: [PATCH 10/10] fast-export: add --always-show-modify-after-rename

Elijah Newren <newren@xxxxxxxxx> · Tue, 13 Nov 2018 09:10:36 -0800

On Tue, Nov 13, 2018 at 6:45 AM Jeff King <peff@xxxxxxxx> wrote:
> It is an expensive log command, but it's the same expense as running
> fast-export, no? And I think maybe that is the disconnect.

I would expect an expensive log command to generally be the same
expense as running fast-export, yes.  But I would expect two expensive
log commands to be twice the expense of a single fast-export (and you
suggested two log commands: both the --find-object= one and the
--diff-filter one).

> I am looking at this problem as "how do you answer question X in a
> repository". And I think you are looking at as "I am receiving a
> fast-export stream, and I need to answer question X on the fly".
>
> And that would explain why you want to get extra annotations into the
> fast-export stream. Is that right?

I'm not trying to get information on the fly during a rewrite or
anything like that.  This is an optional pre-rewrite step (from a
separate invocation of the tool) where I have multiple questions I
want to answer.  I'd like to answer them all relatively quickly, if
possible, and I think all of them should be answerable with a single
history traversal (plus a cat-file --batch-all-objects call to get
object sizes, since I don't know of another way to get those).  I'd be
fine with switching from fast-export to log or something else if it
met the needs better.

As far as I can tell, you're trying to split each question apart and
do a history traversal for each, and I don't see why that's better.
Simpler, perhaps, but it seems worse for performance.  Am I missing
something?

> > > There I think you'd want to assemble the list with something like "git
> > > log --follow --name-only paths-of-interest" except that --follow sucks
> > > too much to handle more than one path at a time.
> > >
> > > But if you wanted to do it manually, then:
> > >
> > >   git log --diff-filter=R --name-only
> > >
> > > would be enough to let you track it down, wouldn't it?
> >
> > Without a -M you'd only catch 100% renames, right?  Those aren't the
> > only ones I'd want to catch, so I'd need to add -M.  You are right
> > that we could get basic renames this way, but it doesn't cover
> > everything I need.  Let's use this as a starting point, though, and
> > build up to what I need...
>
> No, renames are on by default these days, and that includes inexact
> renames. That said, if you're scripting you probably ought to be doing:
>
>   git rev-list HEAD | git diff-tree --stdin
>
> and there yes, you'd have to enable "-M" yourself (you touched on
> scripting and formatting below; diff-tree can accept the format options
> you'd want).

Ah, I didn't know renames were on by default; I somehow missed that.
Also, the rev-list to diff-tree pipe is nice, but I also need parent
and commit timestamp information.

....
> Yeah, I think "-t" would help your tree deletion problem.

Absolutely, thanks for the hint.  Much appreciated.  :-)

> > At this point, let's remember that we had another full git-log
> > invocation for mapping object sizes to filenames.  We might as well
> > coalesce the two log commands into one, by extending this latest one
> > to:
> >
> >   git log -M --diff-filter=RAMD --no-abbrev --raw
>
> What is there besides RAMD? :)

Well, as you pointed out above, log detects renames by default,
whereas it didn't used to.
So, if someone had written some similar-ish history walking/parsing
tool years ago that didn't depend need renames and was based on log
output, there's a good chance their tool might start failing when
rename detection was turned on by default, because instead of getting
both a 'D' and an 'M' change, they'd get an unexpected 'R'.

For my case, do I have to worry about similar future changes?  Will
copy detection ('C') or break detection ('B') become the default in
the future?  Do I have to worry about typechanges ('T")?  Will new
change types be added?  I mean, the fast-export output could maybe
change too, but it seems much less likely than with log.

> > I could potentially switch to using this and drop patch 10/10.
>
> So I'm still not _entirely_ clear on what you're trying to do with
> 10/10. I think maybe the "disconnect" part I wrote above explains it. If
> that's correct, then I think framing it in terms of the operations that
> you'd be able to perform _without running a separate traverse_ would
> make it more obvious.

Let me try to put it as briefly as I can.  With as few traversals as
possible, I want to:
  * Get all blob sizes
  * Map blob shas to filename(s) they appeared under in the history
  * Find when files and directories were deleted (and whether they
were later reinstated, since that means they aren't actually gone)
  * Find sets of filenames referring to the same logical 'file'. (e.g.
foo->bar in commit A and bar->baz in commit B mean that {foo,bar,baz}
refer to the same 'file' so that a user has an easy report to look at
to find out that if they just want to "keep baz and its history" then
they need foo & bar & baz.  I need to know about things like another
foo or bar being introduced after the rename though, since that breaks
the connection between filenames)
  * Do a few aggregations on the above data as well (e.g. all copies
of postgres.exe add up to 20M -- why were those checked in anyway?,
*.webm files in aggregate are .5G, your long-deleted src/video-server/
directory from that aborted experimental project years ago takes up 2G
of your history, etc.)

Right now, my best solution for this combination of questions is
'cat-file --batch-all-objects' plus fast-export, if I get patch 10/10
in place.  I'm totally open to better solutions, including ones that
don't use fast-export.