On Tue, Nov 13, 2018 at 6:45 AM Jeff King <peff@xxxxxxxx> wrote: > It is an expensive log command, but it's the same expense as running > fast-export, no? And I think maybe that is the disconnect. I would expect an expensive log command to generally be the same expense as running fast-export, yes. But I would expect two expensive log commands to be twice the expense of a single fast-export (and you suggested two log commands: both the --find-object= one and the --diff-filter one). > I am looking at this problem as "how do you answer question X in a > repository". And I think you are looking at as "I am receiving a > fast-export stream, and I need to answer question X on the fly". > > And that would explain why you want to get extra annotations into the > fast-export stream. Is that right? I'm not trying to get information on the fly during a rewrite or anything like that. This is an optional pre-rewrite step (from a separate invocation of the tool) where I have multiple questions I want to answer. I'd like to answer them all relatively quickly, if possible, and I think all of them should be answerable with a single history traversal (plus a cat-file --batch-all-objects call to get object sizes, since I don't know of another way to get those). I'd be fine with switching from fast-export to log or something else if it met the needs better. As far as I can tell, you're trying to split each question apart and do a history traversal for each, and I don't see why that's better. Simpler, perhaps, but it seems worse for performance. Am I missing something? > > > There I think you'd want to assemble the list with something like "git > > > log --follow --name-only paths-of-interest" except that --follow sucks > > > too much to handle more than one path at a time. > > > > > > But if you wanted to do it manually, then: > > > > > > git log --diff-filter=R --name-only > > > > > > would be enough to let you track it down, wouldn't it? > > > > Without a -M you'd only catch 100% renames, right? Those aren't the > > only ones I'd want to catch, so I'd need to add -M. You are right > > that we could get basic renames this way, but it doesn't cover > > everything I need. Let's use this as a starting point, though, and > > build up to what I need... > > No, renames are on by default these days, and that includes inexact > renames. That said, if you're scripting you probably ought to be doing: > > git rev-list HEAD | git diff-tree --stdin > > and there yes, you'd have to enable "-M" yourself (you touched on > scripting and formatting below; diff-tree can accept the format options > you'd want). Ah, I didn't know renames were on by default; I somehow missed that. Also, the rev-list to diff-tree pipe is nice, but I also need parent and commit timestamp information. .... > Yeah, I think "-t" would help your tree deletion problem. Absolutely, thanks for the hint. Much appreciated. :-) > > At this point, let's remember that we had another full git-log > > invocation for mapping object sizes to filenames. We might as well > > coalesce the two log commands into one, by extending this latest one > > to: > > > > git log -M --diff-filter=RAMD --no-abbrev --raw > > What is there besides RAMD? :) Well, as you pointed out above, log detects renames by default, whereas it didn't used to. So, if someone had written some similar-ish history walking/parsing tool years ago that didn't depend need renames and was based on log output, there's a good chance their tool might start failing when rename detection was turned on by default, because instead of getting both a 'D' and an 'M' change, they'd get an unexpected 'R'. For my case, do I have to worry about similar future changes? Will copy detection ('C') or break detection ('B') become the default in the future? Do I have to worry about typechanges ('T")? Will new change types be added? I mean, the fast-export output could maybe change too, but it seems much less likely than with log. > > I could potentially switch to using this and drop patch 10/10. > > So I'm still not _entirely_ clear on what you're trying to do with > 10/10. I think maybe the "disconnect" part I wrote above explains it. If > that's correct, then I think framing it in terms of the operations that > you'd be able to perform _without running a separate traverse_ would > make it more obvious. Let me try to put it as briefly as I can. With as few traversals as possible, I want to: * Get all blob sizes * Map blob shas to filename(s) they appeared under in the history * Find when files and directories were deleted (and whether they were later reinstated, since that means they aren't actually gone) * Find sets of filenames referring to the same logical 'file'. (e.g. foo->bar in commit A and bar->baz in commit B mean that {foo,bar,baz} refer to the same 'file' so that a user has an easy report to look at to find out that if they just want to "keep baz and its history" then they need foo & bar & baz. I need to know about things like another foo or bar being introduced after the rename though, since that breaks the connection between filenames) * Do a few aggregations on the above data as well (e.g. all copies of postgres.exe add up to 20M -- why were those checked in anyway?, *.webm files in aggregate are .5G, your long-deleted src/video-server/ directory from that aborted experimental project years ago takes up 2G of your history, etc.) Right now, my best solution for this combination of questions is 'cat-file --batch-all-objects' plus fast-export, if I get patch 10/10 in place. I'm totally open to better solutions, including ones that don't use fast-export.