Re: [PATCH 10/10] fast-export: add --always-show-modify-after-rename

Jeff King <peff@xxxxxxxx> · Mon, 12 Nov 2018 07:58:48 -0500

On Sun, Nov 11, 2018 at 12:42:58AM -0800, Elijah Newren wrote:

> > > fast-export output is traditionally used as an input to a fast-import
> > > program, but it is also useful to help gather statistics about the
> > > history of a repository (particularly when --no-data is also passed).
> > > For example, two of the types of information we may want to collect
> > > could include:
> > >   1) general information about renames that have occurred
> > >   2) what the biggest objects in a repository are and what names
> > >      they appear under.
> > >
> > > The first bit of information can be gathered by just passing -M to
> > > fast-export.  The second piece of information can partially be gotten
> > > from running
> > >     git cat-file --batch-check --batch-all-objects
> > > However, that only shows what the biggest objects in the repository are
> > > and their sizes, not what names those objects appear as or what commits
> > > they were introduced in.  We can get that information from fast-export,
> > > but when we only see
> > >     R oldname newname
> > > instead of
> > >     R oldname newname
> > >     M 100644 $SHA1 newname
> > > then it makes the job more difficult.  Add an option which allows us to
> > > force the latter output even when commits have exact renames of files.
> >
> > fast-export seems like a funny tool to look up paths. What about "git
> > log --find-object=$SHA1" ?
> 
> Eek, and give me O(N*M) behavior, where N is the number of commits in
> the repository and M is the number of renames that occur in its
> history?  Also, that's the inverse of the lookup I need anyway (I have
> the commit and filename, but am missing the SHA).

Maybe I don't understand what you're trying to accomplish. I was
thinking specifically of your "cat-file can tell you the large objects,
but you don't know their names/commits" from above.

I would do:

   git log --raw $(
     git cat-file --batch-check='%(objectsize:disk) %(objectname)' --batch-all-objects |
     sort -rn | head -3 |
     awk '{print "--find-object=" $2 }'
   )

I'm not sure how renames enter into it at all.

> One of the problems with filter-branch that people often run into is
> they know what they want at a high-level (e.g. extract the history of
> this directory for a new repository, or rewrite the history of this
> repo to appear at a subdirectory so it can be merged into a bigger
> repo and people passing filenames to log will still get the history of
> those files, or I want to remove some of the big stuff in my history),
> but often times that's not quite enough.  They need help finding big
> objects, or may be unaware that the subset of files they want used to
> be known by alternative names.
> 
> I want a simple --analyze mode that can report on all files that have
> been renamed (so users don't just say "all I care about is these N
> files, give me a rewritten history just including those" -- we can
> point out to them whether those N files used to be known by other
> names), as well as reporting on all big files and if they've been
> deleted, and aggregations of the "big files" information across
> directories and file extensions.

So this seems like a separate problem than what the commit message talks
about.

There I think you'd want to assemble the list with something like "git
log --follow --name-only paths-of-interest" except that --follow sucks
too much to handle more than one path at a time.

But if you wanted to do it manually, then:

  git log --diff-filter=R --name-only

would be enough to let you track it down, wouldn't it?

-Peff