Re: [PATCH v4 00/10] Sparse index: delete ignored files outside sparse cone

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Aug 24, 2021 at 2:51 PM Derrick Stolee via GitGitGadget
<gitgitgadget@xxxxxxxxx> wrote:
>
> UPDATE: I had to base this version on a merge of ds/add-with-sparse-index
> and v2.33.0-rc1 because of some changes to dir.h that became important!

Isn't this stale (i.e. just copied from v2/v3)?

...
> Updates in V4
> =============
>
>  * Fixed an issue with the split index.
>
>  * The helper methods are used more consistently.
>
>  * The helper method path_in_cone_mode_sparse_checkout() is introduced.
>
>  * Commit messages are edited for clarity.
>
>  * A new config option is added to disable the behavior being added in this
>    series.
>
>  * I split the commit that involves cache_tree_update(). I have not yet
>    succeeded in creating tests to demonstrate why this is required outside
>    of needing it in the Scalar functional tests, which includes a version of
>    partial clone. I will continue to investigate recreating this scenario in
>    the Git test suite, but I wanted to send this version out to get feedback
>    on the things that have changed.

Are you talking about patch 3 or patch 4?

For patch 4, as I commented on that patch, my guess would be that you
start with a full tree, pick some directory that you will sparsify
away, and before sparsifying it away, change some file under that
directory and stage the change.  Then when you sparsify the directory
away with cone mode, I think the cache_tree_update() will need to
write the tree.  At least, some quick greps of cache_tree_update
suggest there aren't that many codepaths that attempt to update it, so
I don't think anything else would have written it yet at that point.

For patch 3, since it comes before patch 5, it would be triggerable
with the same case.  Is it possible you originally found both patches
based off similar initial conditions that just resulted in slightly
different behavioral reports, and thus that patch 4 is good enough
without patch 3?  Trying to ignore that possibility and looking at the
cache_tree_update() code, the only other reasons for failure of
cache_tree_update() that I can see are (a) someone having invalid
hashes recorded in the index (e.g. someone using `git update-index
--index-info` with some invalid hashes, or someone manually deleting
some loose blobs from under their .git/ directory), or (b) disk being
full or read-only so that writing new tree objects to the object store
fail.

> base-commit: 80b8d6c56b8a5f5db1d5c2a0159fd808e8a7fc4f
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1009%2Fderrickstolee%2Fsparse-index%2Fignored-files-v4
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1009/derrickstolee/sparse-index/ignored-files-v4
> Pull-Request: https://github.com/gitgitgadget/git/pull/1009

A super micro nit:

It appears this is based on your other branches that have since been
incorporated into master.  Not a big deal, but rebasing on master
might make it easier for others to take a look at the series (I was
confused at first why this series depended on other unmerged series,
before realizing those other series had since merged down).  I know we
normally don't like rebasing series, but I suspect it wouldn't change
the range-diff (as I don't think others are touching these areas of
the code), and thus be transparent to Junio.



I commented on some individual patches, but have some small comments
that are easier to post in response to the range-diff:

> Range-diff vs v3:
>
>   1:  e66106f7a99 !  1:  c407b2cb346 t7519: rewrite sparse index test
>      @@ t/t7519-status-fsmonitor.sh: test_expect_success 'status succeeds after staging/
>        test_expect_success 'status succeeds with sparse index' '
>       - git reset --hard &&
>       + git clone . full &&
>      -+ git clone . sparse &&
>      ++ git clone --sparse . sparse &&
>       + git -C sparse sparse-checkout init --cone --sparse-index &&
>       + git -C sparse sparse-checkout set dir1 dir2 &&
>
>   2:  fb3ff9108bf !  2:  8660877ba7a sparse-index: silently return when not using cone-mode patterns
>      @@ sparse-index.c: static int index_has_unmerged_entries(struct index_state *istate
>        {
>         int test_env;
>       - if (istate->split_index || istate->sparse_index ||
>      -+
>       + if (istate->split_index || istate->sparse_index || !istate->cache_nr ||
>             !core_apply_sparse_checkout || !core_sparse_checkout_cone)
>                 return 0;
>   3:  37198535268 !  3:  a669740af9a sparse-index: silently return when cache tree fails
>      @@ Commit message
>           convert_to_sparse(), then we could still recover from this scenario and
>           have a sparse index.
>
>      -    When constructing the cache-tree extension in convert_to_sparse(), it is
>      -    possible that we construct a tree object that is new to the object
>      -    database. Without the WRITE_TREE_MISSING_OK flag, this results in an
>      -    error that halts our conversion to a sparse index. Add this flag to
>      -    remove this limitation.
>      -
>           Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
>
>        ## sparse-index.c ##
>      @@ sparse-index.c: int convert_to_sparse(struct index_state *istate)
>       + /*
>       +  * Silently return if there is a problem with the cache tree update,
>       +  * which might just be due to a conflict state in some entry.
>      -+  *
>      -+  * This might create new tree objects, so be sure to use
>      -+  * WRITE_TREE_MISSING_OK.
>       +  */
>      -+ if (cache_tree_update(istate, WRITE_TREE_MISSING_OK))
>      ++ if (cache_tree_update(istate, 0))
>       +         return 0;
>
>         remove_fsmonitor(istate);
>   -:  ----------- >  4:  b379b8fc61a sparse-index: use WRITE_TREE_MISSING_OK

Thanks for splitting patch 3 & 4.  I left some comments on patch 5.
For patch 4, my comments above about testcases might be of interest.

>   4:  10bcadb284e !  5:  acdded0f762 unpack-trees: fix nested sparse-dir search
>      @@ Commit message
>           diff that modified files within the sparse directory entry, but we could
>           not correctly find the entry.
>
>      +    Helped-by: Johannes Schindelin <Johannes.Schindelin@xxxxxx>
>      +    Helped-by: René Scharfe <l.s.r@xxxxxx>
>           Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
>
>        ## unpack-trees.c ##
>      @@ unpack-trees.c: static int sparse_dir_matches_path(const struct cache_entry *ce,
>                                             const struct name_entry *p)
>        {
>       - struct cache_entry *ce;
>      ++ const char *path;
>       + struct cache_entry *ce = NULL;
>         int pos = find_cache_pos(info, p->path, p->pathlen);
>         struct unpack_trees_options *o = info->data;
>      -+ struct strbuf full_path = STRBUF_INIT;
>
>      -  if (0 <= pos)
>      -          return o->src_index->cache[pos];
>      -@@ unpack-trees.c: static struct cache_entry *find_cache_entry(struct traverse_info *info,
>      -  if (pos < 0 || pos >= o->src_index->cache_nr)
>      -          return NULL;
>      -
>      -+ strbuf_addstr(&full_path, info->traverse_path);
>      -+ strbuf_add(&full_path, p->path, p->pathlen);
>      -+ strbuf_addch(&full_path, '/');
>      -+
>      -  /*
>      -   * Due to lexicographic sorting and sparse directory
>      -   * entries ending with a trailing slash, our path as a
>       @@ unpack-trees.c: static struct cache_entry *find_cache_entry(struct traverse_info *info,
>         while (pos >= 0) {
>                 ce = o->src_index->cache[pos];
>
>       -         if (strncmp(ce->name, p->path, p->pathlen))
>      --                 return NULL;
>      -+         if (strncmp(ce->name, full_path.buf, full_path.len)) {
>      -+                 ce = NULL;
>      -+                 break;
>      -+         }
>      ++         if (!skip_prefix(ce->name, info->traverse_path, &path) ||
>      ++             strncmp(path, p->path, p->pathlen) ||
>      ++             path[p->pathlen] != '/')
>      +                  return NULL;
>
>                 if (S_ISSPARSEDIR(ce->ce_mode) &&
>      -              sparse_dir_matches_path(ce, info, p))
>      --                 return ce;
>      -+                 break;
>      -
>      -          pos--;
>      -  }
>      -
>      -- return NULL;
>      -+ strbuf_release(&full_path);
>      -+ return ce;
>      - }
>      -
>      - static void debug_path(struct traverse_info *info)

René's suggestion definitely made this patch simpler (which sticks out
more when you look at the patch rather than the range-diff).

>   5:  5d28570c82a !  6:  1958751aa0e sparse-checkout: create helper methods
>      @@ Commit message
>           adopt these helpers. There are just two in builtin/add.c and
>           sparse-index.c that can use path_in_sparse_checkout().
>
>      +    We add a path_in_cone_mode_sparse_checkout() as well that will only
>      +    return false if the path is outside of the sparse-checkout definition
>      +    _and_ the sparse-checkout patterns are in cone mode.
>      +
>           Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
>
>        ## builtin/add.c ##
>      @@ builtin/add.c: static int refresh(int verbose, const struct pathspec *pathspec)
>       -                     (sparse_checkout_enabled &&
>       -                      !path_matches_pattern_list(path, strlen(path), NULL,
>       -                                                 &dtype, &pl, &the_index))) {
>      -+                     (core_apply_sparse_checkout &&
>      -+                      path_in_sparse_checkout(path, &the_index) == NOT_MATCHED)) {
>      ++                     !path_in_sparse_checkout(path, &the_index)) {
>                                 string_list_append(&only_match_skip_worktree,
>                                                    pathspec->items[i].original);
>                         } else {
>      @@ dir.c: done:
>
>       +int init_sparse_checkout_patterns(struct index_state *istate)
>       +{
>      -+ if (!core_apply_sparse_checkout ||
>      -+     istate->sparse_checkout_patterns)
>      ++ if (!core_apply_sparse_checkout)
>      ++         return 1;
>      ++ if (istate->sparse_checkout_patterns)
>       +         return 0;
>       +
>       + CALLOC_ARRAY(istate->sparse_checkout_patterns, 1);
>      @@ dir.c: done:
>       + return 0;
>       +}
>       +
>      -+int path_in_sparse_checkout(const char *path,
>      -+                     struct index_state *istate)
>      ++static int path_in_sparse_checkout_1(const char *path,
>      ++                              struct index_state *istate,
>      ++                              int require_cone_mode)
>       +{
>       + const char *base;
>       + int dtype = DT_REG;
>      -+ init_sparse_checkout_patterns(istate);
>       +
>      -+ if (!istate->sparse_checkout_patterns)
>      -+         return MATCHED;
>      ++ /*
>      ++  * We default to accepting a path if there are no patterns or
>      ++  * they are of the wrong type.
>      ++  */
>      ++ if (init_sparse_checkout_patterns(istate) ||
>      ++     (require_cone_mode &&
>      ++      !istate->sparse_checkout_patterns->use_cone_patterns))
>      ++         return 1;
>       +
>       + base = strrchr(path, '/');
>       + return path_matches_pattern_list(path, strlen(path), base ? base + 1 : path,
>      @@ dir.c: done:
>       +                                  istate->sparse_checkout_patterns,
>       +                                  istate) > 0;
>       +}
>      ++
>      ++int path_in_sparse_checkout(const char *path,
>      ++                     struct index_state *istate)
>      ++{
>      ++ return path_in_sparse_checkout_1(path, istate, 0);
>      ++}
>      ++
>      ++int path_in_cone_mode_sparse_checkout(const char *path,
>      ++                              struct index_state *istate)
>      ++{
>      ++ return path_in_sparse_checkout_1(path, istate, 1);
>      ++}
>       +
>        static struct path_pattern *last_matching_pattern_from_lists(
>                 struct dir_struct *dir, struct index_state *istate,
>      @@ dir.h: enum pattern_match_result path_matches_pattern_list(const char *pathname,
>       +
>       +int path_in_sparse_checkout(const char *path,
>       +                     struct index_state *istate);
>      ++int path_in_cone_mode_sparse_checkout(const char *path,
>      ++                               struct index_state *istate);
>       +
>        struct dir_entry *dir_add_ignored(struct dir_struct *dir,
>                                   struct index_state *istate,
>      @@ dir.h: enum pattern_match_result path_matches_pattern_list(const char *pathname,
>
>        ## sparse-index.c ##
>       @@ sparse-index.c: static int convert_to_sparse_rec(struct index_state *istate,
>      + {
>         int i, can_convert = 1;
>         int start_converted = num_converted;
>      -  enum pattern_match_result match;
>      +- enum pattern_match_result match;
>       - int dtype = DT_UNKNOWN;
>         struct strbuf child_path = STRBUF_INIT;
>       - struct pattern_list *pl = istate->sparse_checkout_patterns;
>      @@ sparse-index.c: static int convert_to_sparse_rec(struct index_state *istate,
>          */
>       - match = path_matches_pattern_list(ct_path, ct_pathlen,
>       -                                   NULL, &dtype, pl, istate);
>      -+ match = path_in_sparse_checkout(ct_path, istate);
>      -  if (match != NOT_MATCHED)
>      +- if (match != NOT_MATCHED)
>      ++ if (path_in_sparse_checkout(ct_path, istate))
>                 can_convert = 0;
>
>      +  for (i = start; can_convert && i < end; i++) {
>       @@ sparse-index.c: int convert_to_sparse(struct index_state *istate)
>         if (!istate->repo->settings.sparse_index)
>                 return 0;
>      @@ sparse-index.c: int convert_to_sparse(struct index_state *istate)
>       -         if (get_sparse_checkout_patterns(istate->sparse_checkout_patterns) < 0)
>       -                 return 0;
>       - }
>      -+ if (init_sparse_checkout_patterns(istate) < 0)
>      ++ if (init_sparse_checkout_patterns(istate))
>       +         return 0;
>
>         /*
>   6:  c9e100e68f8 !  7:  e496f3cee66 attr: be careful about sparse directories
>      @@ attr.c: static struct attr_stack *read_attr_from_index(struct index_state *istat
>                 return NULL;
>
>       + /*
>      -+  * In the case of cone-mode sparse-checkout, getting the
>      -+  * .gitattributes file from a directory is meaningless: all
>      -+  * contained paths will be sparse if the .gitattributes is also
>      -+  * sparse. In the case of a sparse index, it is critical that we
>      -+  * don't go looking for one as it will expand the index.
>      ++  * The .gitattributes file only applies to files within its
>      ++  * parent directory. In the case of cone-mode sparse-checkout,
>      ++  * the .gitattributes file is sparse if and only if all paths
>      ++  * within that directory are also sparse. Thus, don't load the
>      ++  * .gitattributes file since it will not matter.
>      ++  *
>      ++  * In the case of a sparse index, it is critical that we don't go
>      ++  * looking for a .gitattributes file, as the index will expand.

This comment is better, and I appreciate the thought behind adding the
second paragraph, but the wording seems fuzzy and potentially
confusing to future readers.  Perhaps changing the wording to:

...looking for a .gitattributes file, as doing so would cause the
index to expand.

>       +  */
>      -+ init_sparse_checkout_patterns(istate);
>      -+ if (istate->sparse_checkout_patterns &&
>      -+     istate->sparse_checkout_patterns->use_cone_patterns &&
>      -+     path_in_sparse_checkout(path, istate) == NOT_MATCHED)
>      ++ if (!path_in_cone_mode_sparse_checkout(path, istate))
>       +         return NULL;
>       +
>         buf = read_blob_data_from_index(istate, path, NULL);
>   7:  b0ece4b7dcc !  8:  cab9360b1e9 sparse-index: add SPARSE_INDEX_IGNORE_CONFIG flag
>      @@ Metadata
>       Author: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
>
>        ## Commit message ##
>      -    sparse-index: add SPARSE_INDEX_IGNORE_CONFIG flag
>      +    sparse-index: add SPARSE_INDEX_MEMORY_ONLY flag
>
>           The convert_to_sparse() method checks for the GIT_TEST_SPARSE_INDEX
>           environment variable or the "index.sparse" config setting before
>      @@ Commit message
>           index as an in-memory data structure, regardless of whether the on-disk
>           format should be sparse.
>
>      -    To that end, create the SPARSE_INDEX_IGNORE_CONFIG flag that will skip
>      +    To that end, create the SPARSE_INDEX_MEMORY_ONLY flag that will skip
>           these config checks when enabled. All current consumers are modified to
>           pass '0' in the new 'flags' parameter.
>
>      @@ sparse-index.c: static int index_has_unmerged_entries(struct index_state *istate
>       +int convert_to_sparse(struct index_state *istate, int flags)
>        {
>         int test_env;
>      +- if (istate->split_index || istate->sparse_index || !istate->cache_nr ||
>      ++ if (istate->sparse_index || !istate->cache_nr ||
>      +      !core_apply_sparse_checkout || !core_sparse_checkout_cone)
>      +          return 0;
>
>      -@@ sparse-index.c: int convert_to_sparse(struct index_state *istate)
>         if (!istate->repo)
>                 istate->repo = the_repository;
>
>      @@ sparse-index.c: int convert_to_sparse(struct index_state *istate)
>       - test_env = git_env_bool("GIT_TEST_SPARSE_INDEX", -1);
>       - if (test_env >= 0)
>       -         set_sparse_index_config(istate->repo, test_env);
>      -+ if (!(flags & SPARSE_INDEX_IGNORE_CONFIG)) {
>      ++ if (!(flags & SPARSE_INDEX_MEMORY_ONLY)) {
>      ++         /*
>      ++          * The sparse index is not (yet) integrated with a split index.
>      ++          */
>      ++         if (istate->split_index)
>      ++                 return 0;
>       +         /*
>       +          * The GIT_TEST_SPARSE_INDEX environment variable triggers the
>       +          * index.sparse config variable to be on.
>      @@ sparse-index.c: int convert_to_sparse(struct index_state *istate)
>       +                 return 0;
>       + }
>
>      -  if (init_sparse_checkout_patterns(istate) < 0)
>      +  if (init_sparse_checkout_patterns(istate))
>                 return 0;
>
>        ## sparse-index.h ##
>      @@ sparse-index.h
>
>        struct index_state;
>       -int convert_to_sparse(struct index_state *istate);
>      -+#define SPARSE_INDEX_IGNORE_CONFIG (1 << 0)
>      ++#define SPARSE_INDEX_MEMORY_ONLY (1 << 0)
>       +int convert_to_sparse(struct index_state *istate, int flags);
>
>        /*
>   8:  febef675f05 !  9:  c19d93ec5d7 sparse-checkout: clear tracked sparse dirs
>      @@ Commit message
>           the sparse directory. This depends on the indexed version of the file,
>           so the sparse directory must be expanded.
>
>      +    We must take special care to look for untracked, non-ignored files in
>      +    these directories before deleting them. We do not want to delete any
>      +    meaningful work that the users were doing in those directories and
>      +    perhaps forgot to add and commit before switching sparse-checkout
>      +    definitions. Since those untracked files might be code files that
>      +    generated ignored build output, also do not delete any ignored files
>      +    from these directories in that case. The users can recover their state
>      +    by resetting their sparse-checkout definition to include that directory
>      +    and continue. Alternatively, they can see the warning that is presented
>      +    and delete the directory themselves to regain the performance they
>      +    expect.
>      +
>           By deleting the sparse directories when changing scope (or running 'git
>           sparse-checkout reapply') we regain these performance benefits as if the
>           repository was in a clean state.
>      @@ Documentation/git-sparse-checkout.txt: case-insensitive check. This corrects for
>        'git sparse-checkout set' command to reflect the expected cone in the working
>        directory.
>
>      -+The cone mode sparse-checkout patterns will also remove ignored files that
>      -+are not within the sparse-checkout definition. This is important behavior
>      -+to preserve the performance of the sparse index, but also matches that
>      -+cone mode patterns care about directories, not files. If there exist files
>      -+that are untracked and not ignored, then Git will not delete files within
>      -+that directory other than the tracked files that are now out of scope.
>      -+These files should be removed manually to ensure Git can behave optimally.
>      ++When changing the sparse-checkout patterns in cone mode, Git will inspect each
>      ++tracked directory that is not within the sparse-checkout cone to see if it
>      ++contains any untracked files. If all of those files are ignored due to the
>      ++`.gitignore` patterns, then the directory will be deleted. If any of the
>      ++untracked files within that directory is not ignored, then no deletions will
>      ++occur within that directory and a warning message will appear. If these files
>      ++are important, then reset your sparse-checkout definition so they are included,
>      ++use `git add` and `git commit` to store them, then remove any remaining files
>      ++manually to ensure Git can behave optimally.
>       +
>
>        SUBMODULES
>      @@ builtin/sparse-checkout.c: static int sparse_checkout_list(int argc, const char
>       +  */
>       + if (!r || !r->index || !r->worktree)
>       +         return;
>      -+ init_sparse_checkout_patterns(r->index);
>      -+ if (!r->index->sparse_checkout_patterns ||
>      ++ if (init_sparse_checkout_patterns(r->index) ||
>       +     !r->index->sparse_checkout_patterns->use_cone_patterns)
>       +         return;
>       +
>      @@ builtin/sparse-checkout.c: static int sparse_checkout_list(int argc, const char
>       +          * prevents us from converting to a sparse index, then do
>       +          * not try deleting files.
>       +          */
>      -+         if (convert_to_sparse(r->index, SPARSE_INDEX_IGNORE_CONFIG))
>      ++         if (convert_to_sparse(r->index, SPARSE_INDEX_MEMORY_ONLY))
>       +                 return;
>       +         was_full = 1;
>       + }

Thanks for the extra explanations; those help.

You haven't yet addressed how this choice will interact with
.gitignore files, though (as I commented on v3).  You alluded in patch
7 to how it was critical to avoid looking for .gitattributes files
because doing so would require the index to expand, possibly
recursively all the way down.  Don't .gitignore files have the same
problem?  Nuking all the ignored files so that you can delete extra
directories and thus not need to walk into them during operations
helps, but does the fact that we sometimes avoid removing directories
mean we have to walk into them anyway?  Do we only need to walk into
such directories when they are (still? again?) present in the working
copy?  Does there need to be some code added somewhere that checks for
the presence of a directory in the working copy and unsparsifies
sparse directory entries in the index when such directories are found?

Also, does non-cone mode sparsity paths have weird bugs around letting
users specify patterns that would happen to also exclude .gitignore
files from the working copy, but keep other sibling files within the
same directory?  In particular, if the .gitignore files aren't
present, does git add -A add a bunch of supposed-to-be-ignored files?
(If so, that's an issue that's probably tangential to this series
since it's not about cone mode or the sparse index, but it's certainly
an interesting thought.)

Maybe the whole .gitignore situation with cone mode has a simple
answer, but since this whole series is about gaining performance by
avoiding recursing into directories corresponding to sparse directory
entries, it feels the answers to how .gitignore will be handled should
at least be addressed.

>   -:  ----------- > 10:  8d55a6ba2fd sparse-checkout: add config to disable deleting dirs

I left some comments.


Thanks, as always, for all your hard work on this!




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux