Re: [PATCH 10/10] fast-export: add --always-show-modify-after-rename

Elijah Newren <newren@xxxxxxxxx> · Sun, 11 Nov 2018 00:42:58 -0800

On Sat, Nov 10, 2018 at 11:23 PM Jeff King <peff@xxxxxxxx> wrote:
>
> On Sat, Nov 10, 2018 at 10:23:12PM -0800, Elijah Newren wrote:
>
> > fast-export output is traditionally used as an input to a fast-import
> > program, but it is also useful to help gather statistics about the
> > history of a repository (particularly when --no-data is also passed).
> > For example, two of the types of information we may want to collect
> > could include:
> >   1) general information about renames that have occurred
> >   2) what the biggest objects in a repository are and what names
> >      they appear under.
> >
> > The first bit of information can be gathered by just passing -M to
> > fast-export.  The second piece of information can partially be gotten
> > from running
> >     git cat-file --batch-check --batch-all-objects
> > However, that only shows what the biggest objects in the repository are
> > and their sizes, not what names those objects appear as or what commits
> > they were introduced in.  We can get that information from fast-export,
> > but when we only see
> >     R oldname newname
> > instead of
> >     R oldname newname
> >     M 100644 $SHA1 newname
> > then it makes the job more difficult.  Add an option which allows us to
> > force the latter output even when commits have exact renames of files.
>
> fast-export seems like a funny tool to look up paths. What about "git
> log --find-object=$SHA1" ?

Eek, and give me O(N*M) behavior, where N is the number of commits in
the repository and M is the number of renames that occur in its
history?  Also, that's the inverse of the lookup I need anyway (I have
the commit and filename, but am missing the SHA).

One of the problems with filter-branch that people often run into is
they know what they want at a high-level (e.g. extract the history of
this directory for a new repository, or rewrite the history of this
repo to appear at a subdirectory so it can be merged into a bigger
repo and people passing filenames to log will still get the history of
those files, or I want to remove some of the big stuff in my history),
but often times that's not quite enough.  They need help finding big
objects, or may be unaware that the subset of files they want used to
be known by alternative names.

I want a simple --analyze mode that can report on all files that have
been renamed (so users don't just say "all I care about is these N
files, give me a rewritten history just including those" -- we can
point out to them whether those N files used to be known by other
names), as well as reporting on all big files and if they've been
deleted, and aggregations of the "big files" information across
directories and file extensions.