Corner case involving null sha1, alternates, cache misses, and submodule config API

Mike Hommey <mh@xxxxxxxxxxxx> · Sun, 25 Dec 2016 11:29:04 +0900

Hi,

As you might be aware, I'm working on a mercurial remote helper for git.
The way it stores metadata for mercurial manifests abuses "commit"
references in trees, which are normally used for submodules.

Some operations in the helper use git diff-tree on those trees to find
files faster than just using ls-tree on every commit would.

Anyways, long story short, it turns out that a combination of
everything mentioned in the subject of this email causes running git
diff-tree -r --stdin with a list of 300k+ pairs of commits to take 10
minutes, when (after investigation) adding --ignore-submodules=dirty
made it take 1 minute instead, for the exact same 3GB output.

It turns out, this all starts in is_submodule_ignored(), which contains:

        if (!DIFF_OPT_TST(options, OVERRIDE_SUBMODULE_CONFIG))
                set_diffopt_flags_from_submodule_config(options, path);

And set_diffopt_flags_from_submodule_config calls:

        submodule_from_path(null_sha1, path);

And because there is no actual submodule involved, at some point that
null_sha1 ends up in the call to read_sha1_file from
submodule-config.c's config_from, which then proceeds to try to open the
null sha1 as a loose object in every alternate, doing multiple system
calls in each directory for something that is bound to fail. And to add
pain to injury, it repeats that for each and every line of input to git
diff-tree because the object cache doesn't care about storing negatives
(which makes perfect sense for most cases).

Even worse, when read_object returns NULL because the object doesn't
exist, read_sha1_file_extended calls has_loose_object which does
another set of system calls.

Now, while I realize my use case is very atypical, and that I should
just use --ignore-submodule=dirty, the fact that using the null sha1 can
trigger such behavior strikes me as a footgun that would be better
avoided. Especially when you factor the fact that
read_sha1_file_extended calls lookup_replace_object_extended, which
suggests one might interfere by creating a replace object for the null
sha1. (BTW, it's not entirely clear to me, in the context of actual
submodules, what the various --ignore-submodule options are supposed to
mean for trees that are not the current HEAD ; also, the manual page say
"all" is the default, but that doesn't appear to be true)

>From a cursory look at the output of `git grep \\bnull_sha1` it doesn't
look like the null sha1 is used anywhere else in a similar fashion where
it can be attempted to be read as an object. So, one could consider this
is something the submodule config code should handle on its own by
treating the null_sha1 argument to submodule_from_path (really
config_from) specially. After all, gitmodule_sha1_from_commit already
avoids a get_sha1() call when it's given the null sha1.

OTOH, it seems submodule_from_path and submodule_from_name, the only two
public functions that end up in config_from(), are *always* called with
either the null sha1 or a literal null pointer. The *only* calls to
these functions that doesn't involve a null sha1 or a null pointer is
from test code. So all in all, I'm not entirely sure what this sha1
argument is all about in the first place.

However, an argument could be made that null_sha1 should be treated
specially at a lower level (read_sha1_file, I guess).

What would be sensible to do here?

Mike