On 25/01/2023 06.11, Elijah Newren wrote:
It looks like Ævar and Victoria have both given really good reviews
already, but I think I spotted some additional things to comment on.
On Mon, Jan 23, 2023 at 3:46 AM William Sprent via GitGitGadget
<gitgitgadget@xxxxxxxxx> wrote:
From: William Sprent <williams@xxxxxxxxxxx>
There is currently no way to ask git the question "which files would be
part of a sparse checkout of commit X with sparse checkout patterns Y".
One use-case would be that tooling may want know whether sparse checkouts
of two commits contain the same content even if the full trees differ.
Could you say more about this usecase? Why does tooling need or want
to know this; won't a checkout of the new commit end up being quick
and simple? (I'm not saying your usecase is bad, just curious that it
had never occurred to me, and I'm afraid I'm still not sure what your
purpose might be.)
I'm thinking mainly about a monorepo context where there are a number of
distinct 'units' that can be described with sparse checkout patterns.
And perhaps there's some tooling that only wants to perform an action if
the content of a 'unit' changes.
Depending on the repo, it won't necessarily be quick to check out the
commit with the given patterns. However, it is more about it being
inconvenient to have to have a working directory, especially so if you
want use the tooling in some kind of service or query rapidly about
different revisions/patterns.
Another interesting use-case would be for tooling to use in conjunction
with 'git update-index --index-info'.
Sorry, I'm not following. Could you expound on this a bit?
I was imagining something along the lines of being able to generate new
tree objects based on what matches the given sparse checkout patterns.
Not that I have a specific use case for it right now.
I think what I'm trying to evoke with that paragraph is that this
enables integrations with git that seem interesting and weren't possible
before.
'rev-list --objects --filter=sparse:oid' comes close, but as rev-list is
concerned with objects rather than directory trees, it leaves files out
when the same blob occurs in at two different paths.
s/in at/at/ ?
>> It is possible to ask git about the sparse status of files currently in
the index with 'ls-files -t'. However, this does not work well when the
caller is interested in another commit, intererested in sparsity
s/intererested/interested/
patterns that aren't currently in '.git/info/sparse-checkout', or when
working in with bare repo.
s/in with bare/with a bare/ or s/in with bare/in a bare/?
Ah. Thanks for catching those.
To fill this gap, add a new argument to ls-tree '--sparse-filter-oid'
which takes the object id of a blob containing sparse checkout patterns
that filters the output of 'ls-tree'. When filtering with given sparsity
patterns, 'ls-tree' only outputs blobs and commit objects that
match the given patterns.
This seems slightly unfortunate in that it makes things difficult for
cone mode users to take advantage of. They will have to figure out
how to translate their directory list into sparse checkout patterns
before passing it along, and currently the only way to do that is via
`git sparse-checkout set <patterns>` and reading the patterns from
$GIT_DIR/info/sparse-checkout, but that toggles the sparsity of the
current working tree and avoiding changing the current sparse-checkout
was something you listed in your commit message as something you
wanted to avoid.
>> While it may be valid in some situations to output a tree object -- e.g.
when a cone pattern matches all blobs recursively contained in a tree --
it is less unclear what should be output if a sparse pattern matches
parts of a tree.
To allow for reusing the pattern matching logic found in
'path_in_sparse_checkout_1()' in 'dir.c' with arbitrary patterns,
extract the pattern matching part of the function into its own new
function 'recursively_match_path_with_sparse_patterns()'.
Signed-off-by: William Sprent <williams@xxxxxxxxxxx>
---
ls-tree: add --sparse-filter-oid argument
I'm resubmitting this change as rebased on top of 'master', as it
conflicted with the topic 'ls-tree.c: clean-up works' 1
[https://public-inbox.org/git/20230112091135.20050-1-tenglong.tl@xxxxxxxxxxxxxxx],
which was merged to 'master' recently.
This versions also incorporates changes based on the comments made in 2
[https://public-inbox.org/git/CAPig+cRgZ0CrkqY7mufuWmhf6BC8yXjXXuOTEQjuz+Y0NA+N7Q@xxxxxxxxxxxxxx/].
I'm also looping in contributors that have touched ls-tree and/or
sparse-checkouts recently. I hope that's okay.
Of course! It's encouraged, even.
[...]
diff --git a/builtin/ls-tree.c b/builtin/ls-tree.c
[...]
+static void init_sparse_filter_data(struct sparse_filter_data **d, struct ls_tree_options *options,
+ const char *sparse_oid_name, read_tree_fn_t fn)
+{
+ struct object_id sparse_oid;
+ struct object_context oc;
+
+ (*d) = xcalloc(1, sizeof(**d));
+ (*d)->fn = fn;
+ (*d)->pl.use_cone_patterns = core_sparse_checkout_cone;
Hmm, so the behavior still depends upon the current sparse-checkout
(or lack thereof), despite the documentation and rationale of your
feature as being there to check how a different sparse checkout would
behave?
I would hate to unconditionally turn cone_patterns off, since that
would come with a huge performance penalty for the biggest repos. But
turning it unconditionally on wouldn't be good for the non-cone users.
This probably suggests we need something like another flag, or perhaps
separate flags for each mode. Separate flags might provide the
benefit of allowing cone mode users to specify directories rather than
patterns, which would make it much easier for them to use.
I used 'core_sparse_checkout_cone' because I wanted to allow for the
cone mode optimisations, but I also figured that I should respect the
configuration. It doesn't change how the patterns are parsed in this case.
I agree that it is a bit awkward to have to "translate" the directories
into patterns when wanting to use cone mode. I can try adding
'--[no]-cone' flags and see how that feels. Together with Victoria's
suggestions that would result in having the following flags:
* --scope=(sparse|all)
* --sparse-patterns-file=<path>
* --[no]-cone: used together with --sparse-patterns-file to tell git
whether to interpret the patterns given as directories (cone) or
patterns (no-cone).
Which seems like a lot at first glance. But it allows for passing
directories instead of patterns for cone mode, and is similar to the
behaviour of 'sparse-checkout set'.
Does that seem like something that would make sense?
[...]
+static int path_matches_sparse_checkout_patterns(struct strbuf *full_path, struct pattern_list *pl, int dtype)
+{
+ enum pattern_match_result match = recursively_match_path_with_sparse_patterns(full_path->buf, the_repository->index, dtype, pl);
+ return match > 0;
So your new caller doesn't care about the pattern_match_result, it
just wants to know if it got MATCHED or MATCHED_RECURSIVELY...
[...]
diff --git a/dir.c b/dir.c
index 4e99f0c868f..122ebced08e 100644
--- a/dir.c
+++ b/dir.c
@@ -1457,45 +1457,50 @@ int init_sparse_checkout_patterns(struct index_state *istate)
return 0;
}
-static int path_in_sparse_checkout_1(const char *path,
- struct index_state *istate,
- int require_cone_mode)
+int recursively_match_path_with_sparse_patterns(const char *path,
You claim it returns an int here, but previously you presumed an enum
pattern_match_result from the new caller.
+ struct index_state *istate,
+ int dtype,
+ struct pattern_list *pl)
{
- int dtype = DT_REG;
enum pattern_match_result match = UNDECIDED;
const char *end, *slash;
-
- /*
- * We default to accepting a path if the path is empty, there are no
- * patterns, or the patterns are of the wrong type.
- */
- if (!*path ||
- init_sparse_checkout_patterns(istate) ||
- (require_cone_mode &&
- !istate->sparse_checkout_patterns->use_cone_patterns))
- return 1;
-
/*
* If UNDECIDED, use the match from the parent dir (recursively), or
* fall back to NOT_MATCHED at the topmost level. Note that cone mode
* never returns UNDECIDED, so we will execute only one iteration in
* this case.
*/
- for (end = path + strlen(path);
- end > path && match == UNDECIDED;
+ for (end = path + strlen(path); end > path && match == UNDECIDED;
end = slash) {
-
for (slash = end - 1; slash > path && *slash != '/'; slash--)
; /* do nothing */
match = path_matches_pattern_list(path, end - path,
slash > path ? slash + 1 : path, &dtype,
- istate->sparse_checkout_patterns, istate);
+ pl, istate);
/* We are going to match the parent dir now */
dtype = DT_DIR;
}
- return match > 0;
+
+ return match;
Um, this last line seems like a potentially scary change in behavior.
Why should UNDECIDED return a non-zero value? Previously, we returned
a 0 value for both UNDECIDED and NOT_MATCHED, but you've changed that
here. If the change in this last line is actually correct, it should
be split out into its own commit and explained in detail in the commit
message.
+}
+
+static int path_in_sparse_checkout_1(const char *path,
+ struct index_state *istate,
+ int require_cone_mode)
+{
+ /*
+ * We default to accepting a path if the path is empty, there are no
+ * patterns, or the patterns are of the wrong type.
+ */
+ if (!*path ||
+ init_sparse_checkout_patterns(istate) ||
+ (require_cone_mode &&
+ !istate->sparse_checkout_patterns->use_cone_patterns))
+ return 1;
+
+ return recursively_match_path_with_sparse_patterns(path, istate, DT_REG, istate->sparse_checkout_patterns) > 0;
Oh, you compare to > 0 here...and digging around your only other
caller just immediately compares to > 0 as well.
Why not just have recursively_match_path_with_sparse_patterns() do the
0 check? If it does, returning int is fine. If it doesn't, it
should be declared to return enum pattern_match_result.
I think my thinking was that it made sense in general for
'recursively_match_path_with_sparse_patterns()' to expose the match
result, but then failed to declare it that way. Thanks for catching it.