> Ooh, I think there's something subtle hiding in this paragraph. > > When I first read it, I thought you meant that the repositories are > not self-contained --- that they contain references to the empty tree > but do not fulfill "want"s for them. I don't believe that's what you > mean, though. > > My second reading is the repository genuinely doesn't contain the > empty tree but different Git server implementations handle that > differently. I tried to reproduce this with > > empty_tree=$(git mktree </dev/null) > git init --bare x > git clone --filter=blob:none file://$(pwd)/x y > cd y > echo hi >README > git add README > git commit -m 'nonempty tree' > GIT_TRACE=1 git diff-tree "$empty_tree" HEAD > > and indeed, it looks like Git serves the empty tree even from > repositories that don't contain it. By comparison, JGit does not do > the same trick. So we don't need to refer to a specific repository in > the wild to reproduce this. > > All that said, having to fetch this object in the first place is > unexpected. The question of the promisor remote serving it is only > relevant from the point of view of "why didn't we discover this > sooner?" Yes, that's true. I'll reword it to emphasize the spurious fetching (and mention some implementations like JGit not serving those, which exacerbates the problem). I think we didn't discover this sooner because not many people directly enter the empty tree on the command line :-) (but this is a problem that we should solve, most certainly). > > There are 2 functions: repo_has_object_file() which does not consult > > find_cached_object() (which, among other things, knows about the empty > > tree); and repo_read_object_file() which does. > > Hm, where does this dichotomy come from? E.g. is the latter a > lower-level function used by the former? I don't know the reason for the dichotomy - perhaps it was an oversight. The relevant commits are: d66b37bb19 ("Add pretend_sha1_file() interface.", 2007-02-05) Adds a way to pretend that some objects are present by including them in a cache - empty-tree-pervasiveness is built on top of this later. Only read_sha1_file() was updated to make use of this; has_sha1_file() and sha1_object_info() were already present at this time, but were not updated. (Maybe the latter 2 had no need for pretending?) 346245a1bb ("hard-code the empty tree object", 2008-02-13) Empty tree pervasiveness built on top of the cache. c4d9986f5f ("sha1_object_info: examine cached_object store too", 2011-02-07) sha1_object_info() was updated to use the cache. has_sha1_file() is never mentioned. > > as an optimization to avoid reading blobs into memory, > > parse_object() calls repo_has_object_file() before > > repo_read_object_file(). In the case of a regular repository (that is, > > not a partial clone), repo_has_object_file() will return false for the > > empty tree (thus bypassing the optimization) and repo_read_object_file() > > will nevertheless succeed, thus things coincidentally work. > > This might be easier to understand if phrased in terms of the > intention behind the code instead of the specific call stacks used. > See f06ab027 for an example of this kind of thing. For example: > > Applications within and outside of Git benefit from being able to > assume that every repository contains the empty tree as an object > (see, for example, commit 9abd46a347 "Make sure the empty tree > exists when needed in merge-recursive", 2006-12-07). To this end, > since 346245a1bb (hard-code the empty tree object, 2008-02-13), Git > has made the empty tree available in all repositories via > find_cached_object, which all object access paths can use. > > Object existence checks (has_object_file), however, do not use > find_cached_object. <describe reason here> > > When I state it this way, it's not obvious why this particular caller > of has_object_file is more relevant than others. Did I miss some > subtlety? This particular caller is what caused us to notice this problem. > Indeed. Can we justify the change even more simply in those terms? > > Object existence checks (has_object_file), however, do not use > find_cached_object. <describe reason here> > > This makes the API unnecessarily difficult to reason about. > Instead, let's consistently view the empty tree as a genuine part of > the repository, even in has_object_file. As a side effect, this > allows us to simplify the common 'has_object_file || > find_cached_object' pattern to a more simple existence check. OK, let me see if I can rewrite it to emphasize the reasoning-about-API part. I still want to include the user-facing part that caused us to observe this problem, but I can deemphasize it. This might be a moot point, but what do you mean by the "'has_object_file || find_cached_object' pattern"?