On Sat, Nov 10, 2018 at 11:23 PM Jeff King <peff@xxxxxxxx> wrote: > > On Sat, Nov 10, 2018 at 10:23:12PM -0800, Elijah Newren wrote: > > > fast-export output is traditionally used as an input to a fast-import > > program, but it is also useful to help gather statistics about the > > history of a repository (particularly when --no-data is also passed). > > For example, two of the types of information we may want to collect > > could include: > > 1) general information about renames that have occurred > > 2) what the biggest objects in a repository are and what names > > they appear under. > > > > The first bit of information can be gathered by just passing -M to > > fast-export. The second piece of information can partially be gotten > > from running > > git cat-file --batch-check --batch-all-objects > > However, that only shows what the biggest objects in the repository are > > and their sizes, not what names those objects appear as or what commits > > they were introduced in. We can get that information from fast-export, > > but when we only see > > R oldname newname > > instead of > > R oldname newname > > M 100644 $SHA1 newname > > then it makes the job more difficult. Add an option which allows us to > > force the latter output even when commits have exact renames of files. > > fast-export seems like a funny tool to look up paths. What about "git > log --find-object=$SHA1" ? Eek, and give me O(N*M) behavior, where N is the number of commits in the repository and M is the number of renames that occur in its history? Also, that's the inverse of the lookup I need anyway (I have the commit and filename, but am missing the SHA). One of the problems with filter-branch that people often run into is they know what they want at a high-level (e.g. extract the history of this directory for a new repository, or rewrite the history of this repo to appear at a subdirectory so it can be merged into a bigger repo and people passing filenames to log will still get the history of those files, or I want to remove some of the big stuff in my history), but often times that's not quite enough. They need help finding big objects, or may be unaware that the subset of files they want used to be known by alternative names. I want a simple --analyze mode that can report on all files that have been renamed (so users don't just say "all I care about is these N files, give me a rewritten history just including those" -- we can point out to them whether those N files used to be known by other names), as well as reporting on all big files and if they've been deleted, and aggregations of the "big files" information across directories and file extensions.