On Tue, Jan 28, 2020 at 01:35:08PM -0800, Jonathan Tan wrote: > Running "git status" on a partial clone that was cloned with > "--no-checkout", like this: > > git clone --no-checkout --filter=blob:none <url> foo > git -C foo status > > results in an unnecessary lazy fetch. This is because of commit > 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08), which > optimized "diff" by prefetching, but inadvertently affected at least one > command ("status") that does not need prefetching. These 2 cases can be > distinguished by looking at output_format; the former uses > DIFF_FORMAT_PATCH and the latter uses DIFF_FORMAT_CALLBACK. Sometimes "status" does need the actual blobs, if it's going to do inexact rename detection. Likewise if break-detection is turned on, though I don't think anything does it by default, and there's (I don't think) any config option to enable it. So something like "git log --name-status -B -M" would probably regress from this (though only in speed, of course; I do think we can play a little loose with heuristics since we'd generate the correct answer either way). You could get pretty specific by putting logic inside diffcore_rename(), which would know if anything is left over after exact rename detection, but I suspect just checking: (options->break_opt != -1 || options->detect_rename) in diffcore_std() would be OK in practice. > (Note that other callers that use DIFF_FORMAT_CALLBACK will also lose > prefetching. I haven't investigated enough to see if this is a net > benefit or drawback, but I think this will need to be done on a > caller-by-caller basis and can be done in the future.) We can't know what the caller is going to do with a DIFF_FORMAT_CALLBACK, so I think it makes sense to avoid pre-fetching in that case. It might be nice, though, if that pre-fetch loop in diffcore_std() were pulled into a helper function, so they could just call: diffcore_prefetch(&options); or something. I'm OK to wait on that until one of the FORMAT_CALLBACK users needs it, though. > Points of discussion I can think of: > > - Is the whitelist of output_format constants the best? I think it could be pared down a bit. For example, --raw doesn't need the blobs (aside from renames, etc, above). I think the same is true of --summary. You've already omitted --name-status and --name-only, which makes sense. I think --dirstat, even though it is only showing per-file info, still relies on the line-level stat info. So it should stay. > - Should we just have the callers pass a prefetch boolean arg instead > of trying to guess it ourselves? I'm leaning towards no since I think > we should avoid options unless they are necessary. If we can avoid it, I think we should. It makes sense to me to try to tweak the heuristics as much as possible in this central code. If there's a case that does the wrong thing, then we can decide whether the heuristics can be improved, or if we need more information from the caller. -Peff