Re: [PATCH] cache-tree: avoid needless promisor object fetch

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Fri, 23 Apr 2021 14:28:13 +0200 (CEST)

Hi Jonathan,

On Thu, 22 Apr 2021, Jonathan Tan wrote:

> In update_one() (used only by cache_tree_update()), there is an object
> existence check that, if it fails, will automatically trigger a lazy
> fetch in a partial clone. But the fetch is not necessary - the object is
> not actually being used.

I find it curious, though, that the `ce_missing_ok` variable is defined
thusly (sadly, the context of your diff is too small to show it):

                ce_missing_ok = mode == S_IFGITLINK || missing_ok ||
                        (has_promisor_remote() &&
                         ce_skip_worktree(ce));

Which means that the `has_object_file()` function is only called if the
entry is not marked with the `skip-worktree` bit, i.e. if it is _not_
excluded from the sparse checkout.

Wouldn't that mean that the object _should_ be there?

I guess what I am saying is that while the commit message focuses on the
"What?" of the patch, I would love to hear more about the "Why?". And
maybe the "When?" as in: when does this actually matter?

And since the bug was critical enough for you to spend time on crafting
it, maybe it would make sense to add a regression test to ensure that this
bug does not creep in again?

>
> Replace that check with two checks: an object existence check that does
> not fetch, and then a check that that object is a promisor object.

This essentially repeats what the diff says, but it might make more sense
to explain why the post-image of this diff is more correct (and maybe
discuss performance implications).

>
> Doing this avoids multiple lazy fetches when merging two trees in a
> partial clone, as noticed at $DAYJOB.

Ah. But where are those trees fetched, then?

Maybe lead with the description of the bug?

> Signed-off-by: Jonathan Tan <jonathantanmy@xxxxxxxxxx>
> ---
> Another alternative is to think about whether the object existence check
> here is needed in the first place.
>
> There might also be other places we can make a similar change in
> update_one(), but I limited myself to what's needed to solve the
> specific case we discovered at $DAYJOB.

I only see another `has_object_file()` call site at the very beginning,
and I think this needs to fetch. Or maybe it is more efficient to
construct the cache tree from scratch than fetch it?

There is also `cache_tree_fully_valid_1()`, where I think the same
handling could potentially make sense. (Or, if you target `seen`,
`cache_tree_fully_valid()`.

Ciao,
Johannes

> ---
>  cache-tree.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/cache-tree.c b/cache-tree.c
> index add1f07713..6728722597 100644
> --- a/cache-tree.c
> +++ b/cache-tree.c
> @@ -6,6 +6,7 @@
>  #include "object-store.h"
>  #include "replace-object.h"
>  #include "promisor-remote.h"
> +#include "packfile.h"
>
>  #ifndef DEBUG_CACHE_TREE
>  #define DEBUG_CACHE_TREE 0
> @@ -362,7 +363,9 @@ static int update_one(struct cache_tree *it,
>  			(has_promisor_remote() &&
>  			 ce_skip_worktree(ce));
>  		if (is_null_oid(oid) ||
> -		    (!ce_missing_ok && !has_object_file(oid))) {
> +		    (!ce_missing_ok &&
> +		     !has_object_file_with_flags(oid, OBJECT_INFO_SKIP_FETCH_OBJECT) &&
> +		     !is_promisor_object(oid))) {
>  			strbuf_release(&buffer);
>  			if (expected_missing)
>  				return -1;
> --
> 2.31.1.498.g6c1eba8ee3d-goog
>
>