Re: [PATCH] sha1-file: remove OBJECT_INFO_SKIP_CACHED

Jonathan Tan <jonathantanmy@xxxxxxxxxx> · Mon, 30 Dec 2019 16:39:03 -0800

> Ooh, I think there's something subtle hiding in this paragraph.
> 
> When I first read it, I thought you meant that the repositories are
> not self-contained --- that they contain references to the empty tree
> but do not fulfill "want"s for them.  I don't believe that's what you
> mean, though.
> 
> My second reading is the repository genuinely doesn't contain the
> empty tree but different Git server implementations handle that
> differently.  I tried to reproduce this with
> 
> 	empty_tree=$(git mktree </dev/null)
> 	git init --bare x
> 	git clone --filter=blob:none file://$(pwd)/x y
> 	cd y
> 	echo hi >README
> 	git add README
> 	git commit -m 'nonempty tree'
> 	GIT_TRACE=1 git diff-tree "$empty_tree" HEAD
> 
> and indeed, it looks like Git serves the empty tree even from
> repositories that don't contain it.  By comparison, JGit does not do
> the same trick.  So we don't need to refer to a specific repository in
> the wild to reproduce this.
> 
> All that said, having to fetch this object in the first place is
> unexpected.  The question of the promisor remote serving it is only
> relevant from the point of view of "why didn't we discover this
> sooner?"

Yes, that's true. I'll reword it to emphasize the spurious fetching (and
mention some implementations like JGit not serving those, which
exacerbates the problem).

I think we didn't discover this sooner because not many people directly
enter the empty tree on the command line :-) (but this is a problem that
we should solve, most certainly).

> > There are 2 functions: repo_has_object_file() which does not consult
> > find_cached_object() (which, among other things, knows about the empty
> > tree); and repo_read_object_file() which does.
> 
> Hm, where does this dichotomy come from?  E.g. is the latter a
> lower-level function used by the former?

I don't know the reason for the dichotomy - perhaps it was an oversight.
The relevant commits are:

d66b37bb19 ("Add pretend_sha1_file() interface.", 2007-02-05)
Adds a way to pretend that some objects are present by including them in
a cache - empty-tree-pervasiveness is built on top of this later. Only
read_sha1_file() was updated to make use of this; has_sha1_file() and
sha1_object_info() were already present at this time, but were not
updated. (Maybe the latter 2 had no need for pretending?)

346245a1bb ("hard-code the empty tree object", 2008-02-13)
Empty tree pervasiveness built on top of the cache.

c4d9986f5f ("sha1_object_info: examine cached_object store too",
2011-02-07)
sha1_object_info() was updated to use the cache. has_sha1_file() is
never mentioned.

> > as an optimization to avoid reading blobs into memory,
> > parse_object() calls repo_has_object_file() before
> > repo_read_object_file(). In the case of a regular repository (that is,
> > not a partial clone), repo_has_object_file() will return false for the
> > empty tree (thus bypassing the optimization) and repo_read_object_file()
> > will nevertheless succeed, thus things coincidentally work.
> 
> This might be easier to understand if phrased in terms of the
> intention behind the code instead of the specific call stacks used.
> See f06ab027 for an example of this kind of thing.  For example:
> 
>   Applications within and outside of Git benefit from being able to
>   assume that every repository contains the empty tree as an object
>   (see, for example, commit 9abd46a347 "Make sure the empty tree
>   exists when needed in merge-recursive", 2006-12-07).  To this end,
>   since 346245a1bb (hard-code the empty tree object, 2008-02-13), Git
>   has made the empty tree available in all repositories via
>   find_cached_object, which all object access paths can use.
> 
>   Object existence checks (has_object_file), however, do not use
>   find_cached_object.  <describe reason here>
> 
> When I state it this way, it's not obvious why this particular caller
> of has_object_file is more relevant than others.  Did I miss some
> subtlety?

This particular caller is what caused us to notice this problem.

> Indeed.  Can we justify the change even more simply in those terms?
> 
>   Object existence checks (has_object_file), however, do not use
>   find_cached_object.  <describe reason here>
> 
>   This makes the API unnecessarily difficult to reason about.
>   Instead, let's consistently view the empty tree as a genuine part of
>   the repository, even in has_object_file.  As a side effect, this
>   allows us to simplify the common 'has_object_file ||
>   find_cached_object' pattern to a more simple existence check.

OK, let me see if I can rewrite it to emphasize the reasoning-about-API
part. I still want to include the user-facing part that caused us to
observe this problem, but I can deemphasize it.

This might be a moot point, but what do you mean by the
"'has_object_file || find_cached_object' pattern"?