Re: Corner case involving null sha1, alternates, cache misses, and submodule config API

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Dec 24, 2016 at 6:29 PM, Mike Hommey <mh@xxxxxxxxxxxx> wrote:
> Hi,
>
> As you might be aware, I'm working on a mercurial remote helper for git.
> The way it stores metadata for mercurial manifests abuses "commit"
> references in trees, which are normally used for submodules.
>
> Some operations in the helper use git diff-tree on those trees to find
> files faster than just using ls-tree on every commit would.
>
> Anyways, long story short, it turns out that a combination of
> everything mentioned in the subject of this email causes running git
> diff-tree -r --stdin with a list of 300k+ pairs of commits to take 10
> minutes, when (after investigation) adding --ignore-submodules=dirty
> made it take 1 minute instead, for the exact same 3GB output.
>
> It turns out, this all starts in is_submodule_ignored(), which contains:
>
>         if (!DIFF_OPT_TST(options, OVERRIDE_SUBMODULE_CONFIG))
>                 set_diffopt_flags_from_submodule_config(options, path);
>
> And set_diffopt_flags_from_submodule_config calls:
>
>         submodule_from_path(null_sha1, path);
>
> And because there is no actual submodule involved, at some point that
> null_sha1 ends up in the call to read_sha1_file from
> submodule-config.c's config_from, which then proceeds to try to open the
> null sha1 as a loose object in every alternate, doing multiple system
> calls in each directory for something that is bound to fail. And to add
> pain to injury, it repeats that for each and every line of input to git
> diff-tree because the object cache doesn't care about storing negatives
> (which makes perfect sense for most cases).
>
> Even worse, when read_object returns NULL because the object doesn't
> exist, read_sha1_file_extended calls has_loose_object which does
> another set of system calls.
>
> Now, while I realize my use case is very atypical, and that I should
> just use --ignore-submodule=dirty, the fact that using the null sha1 can
> trigger such behavior strikes me as a footgun that would be better
> avoided. Especially when you factor the fact that
> read_sha1_file_extended calls lookup_replace_object_extended, which
> suggests one might interfere by creating a replace object for the null
> sha1. (BTW, it's not entirely clear to me, in the context of actual
> submodules, what the various --ignore-submodule options are supposed to
> mean for trees that are not the current HEAD ; also, the manual page say
> "all" is the default, but that doesn't appear to be true)

I think at this high level we should start optimizing submodules, i.e.
not HEAD / working tree -> select less aggressive submodule config.

>
> From a cursory look at the output of `git grep \\bnull_sha1` it doesn't
> look like the null sha1 is used anywhere else in a similar fashion where
> it can be attempted to be read as an object. So, one could consider this
> is something the submodule config code should handle on its own by
> treating the null_sha1 argument to submodule_from_path (really
> config_from) specially. After all, gitmodule_sha1_from_commit already
> avoids a get_sha1() call when it's given the null sha1.
>
> OTOH, it seems submodule_from_path and submodule_from_name, the only two
> public functions that end up in config_from(), are *always* called with
> either the null sha1 or a literal null pointer.

When Heiko introduced the submodule config, the consensus
was to have a flexible API to lookup submodule configuration.
The sha1 argument defines where to lookup the submodule
config (e.g. you could have moved a submodule such that the
submodule_from_path returns a different submodule for HEAD^
than HEAD.

The null_sha1 is used for current working tree configuration, i.e.
load .gitmodules and then overload with .git/config.

origin/bw/grep-submodules changes the behavior as it preloads the
null_sha1 to be an actual commit-ish (HEAD^').

I may make use of it in the checkout --recurse-submodules series.

> The *only* calls to
> these functions that doesn't involve a null sha1 or a null pointer is
> from test code. So all in all, I'm not entirely sure what this sha1
> argument is all about in the first place.

See Documentation/technical/api-submodule-config.txt (IIRC)
in next(?).

>
> However, an argument could be made that null_sha1 should be treated
> specially at a lower level (read_sha1_file, I guess).
>
> What would be sensible to do here?

As said above, the nul_sha1 is (ab-)used to mean
"use the preloaded config" which as of now is .git/config on top
of .gitmodules in the working tree.

I would assume we want another earlier and cheaper switch
for all these functions. This may also come in handy for the
git-series, that uses gitlinks as non-submodules.

Something like

    int treat_gitlinks_as_submodule()

which can be configured to be "auto"[1]  and then
"on", "off" that can be set for tools such as mercurial
remote helper or git-series.

[1] the default that
requires some slightly more expensive guess work, e.g.
the existence of the .gitmodules file. c.f.
https://github.com/git/git/commit/1863e05af5be73d9f7b9a1d22a33ca9849726623
static void preset_submodule_default(void)
would be a starting point to cheaply estimate
if gitlinks are submodules.



>
> Mike



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]