Re: [PATCH 10/10] fast-export: add --always-show-modify-after-rename

Jeff King <peff@xxxxxxxx> · Tue, 13 Nov 2018 09:45:54 -0500

On Mon, Nov 12, 2018 at 10:08:10AM -0800, Elijah Newren wrote:

> > I would do:
> >
> >    git log --raw $(
> >      git cat-file --batch-check='%(objectsize:disk) %(objectname)' --batch-all-objects |
> >      sort -rn | head -3 |
> >      awk '{print "--find-object=" $2 }'
> >    )
> >
> > I'm not sure how renames enter into it at all.
> 
> How did I miss objectsize:disk??  Especially since it is right next to
> objectsize in the manpage to boot?  That's awesome, thanks for that
> pointer.
> 
> I do have a separate cat-file --batch-check --batch-all-objects
> process already, since I can't get sizes out of either log or
> fast-export.  However, I wouldn't use your 'head -3' since I'm not
> looking for the N biggest, but reporting on _all_ objects (in reverse
> size order) and letting the user look over the report and deciding
> where to stop reading.  So, this is a big and expensive log command.
> Granted, we will need a big and expensive log command, but let's keep
> in mind that we have this one.

It is an expensive log command, but it's the same expense as running
fast-export, no? And I think maybe that is the disconnect.

I am looking at this problem as "how do you answer question X in a
repository". And I think you are looking at as "I am receiving a
fast-export stream, and I need to answer question X on the fly".

And that would explain why you want to get extra annotations into the
fast-export stream. Is that right?

> > There I think you'd want to assemble the list with something like "git
> > log --follow --name-only paths-of-interest" except that --follow sucks
> > too much to handle more than one path at a time.
> >
> > But if you wanted to do it manually, then:
> >
> >   git log --diff-filter=R --name-only
> >
> > would be enough to let you track it down, wouldn't it?
> 
> Without a -M you'd only catch 100% renames, right?  Those aren't the
> only ones I'd want to catch, so I'd need to add -M.  You are right
> that we could get basic renames this way, but it doesn't cover
> everything I need.  Let's use this as a starting point, though, and
> build up to what I need...

No, renames are on by default these days, and that includes inexact
renames. That said, if you're scripting you probably ought to be doing:

  git rev-list HEAD | git diff-tree --stdin

and there yes, you'd have to enable "-M" yourself (you touched on
scripting and formatting below; diff-tree can accept the format options
you'd want).

> I also want to know when files were deleted.  I've generally found
> that people are more okay with purging parts of history [corresponding
> to large ojbects] that were deleted longer ago than more recent stuff,
> for a variety of reasons.  So we could either run yet another log, or
> modify the command to:
> 
>   git log -M --diff-filter=RD --name-status
> 
> However, I don't just want to know when files were deleted, I'd like
> to know when directories are deleted.  I only knew how to derive that
> from knowing what files existed within those directories, so that
> would take me to:
> 
>   git log -M --diff-filter=RAD --name-status
> 
> [Edit: I just saw your other email and for the first time learned
> about the -t rev-list option which might simplify this a little,
> although "need to worry about deleted files being reinstated" below
> might require the 'A' anyway.]

Yeah, I think "-t" would help your tree deletion problem.

> At this point, let's remember that we had another full git-log
> invocation for mapping object sizes to filenames.  We might as well
> coalesce the two log commands into one, by extending this latest one
> to:
> 
>   git log -M --diff-filter=RAMD --no-abbrev --raw

What is there besides RAMD? :)

> I could potentially switch to using this and drop patch 10/10.

So I'm still not _entirely_ clear on what you're trying to do with
10/10. I think maybe the "disconnect" part I wrote above explains it. If
that's correct, then I think framing it in terms of the operations that
you'd be able to perform _without running a separate traverse_ would
make it more obvious.

> Anyway, I hope it makes a little more sense why I created this patch.
> Does it, or have I just made things even more confusing?

Some of both, I think.

> ...and if you've read this far, I'm impressed.  Thanks for reading.

I'll admit I skimmed near the end. ;)

-Peff