I didn't apply the patch and test it out but from just a code review, the logic behind and the design of this patch makes sense and it's a relatively small patch for the gain. It's also a great example of the minimal amount of work required to add a new extension into the index. Thank you for that. For more extreme cases where there are large numbers of entries in the sparse-checkout file, another approach may be more helpful. Our virtualization solution uses sparse checkout extensively. The sparse-checkout file starts out empty and as files are hydrated locally, they are added to the sparse-checkout file. This results in thousands of entries in the sparse-checkout file. To make that fast enough, we ended up creating a hashmap in mark_new_skip_worktree and then in clear_ce_flags_1 we use that hashmap to implement the pattern matching logic. The hashmap lookup is dramatically faster than the current recursive and complex pattern matching via the excludes mechanism so easily supports very large numbers of entries. Note, these are extreme test cases that caused the unmodified git.exe to crash so I can't really give before/after comparisons: With 35,000 entries in the sparse-checkout file, "git checkout -b xxx" took 42.5 seconds With 100,000 entries in the sparse-checkout file, "git checkout -b xxx" took 42.5 seconds With 3,279,254 entries in the sparse-checkout file, "git checkout -b xxx" took 1 min 38 seconds Note the first two numbers are the same because the hashmap based pattern matching is so fast, 95% of the time is now spent in merge_working_tree. Since the tip commit doesn’t change between the new and old branch and because a merge is done, the files in the working directory don’t change either so I believe there must be an optimization there but I haven't been able to figure one out that doesn't break a lot of other things. Ben p.s. Yes, I actually ran "git ls-files > .git/info/sparse-checkout" to test the extreme case. :) -----Original Message----- From: git-owner@xxxxxxxxxxxxxxx [mailto:git-owner@xxxxxxxxxxxxxxx] On Behalf Of Duy Nguyen Sent: Saturday, August 13, 2016 4:37 AM To: Git Mailing List <git@xxxxxxxxxxxxxxx>; Junio C Hamano <gitster@xxxxxxxxx> Cc: Nguyễn Thái Ngọc Duy <pclouds@xxxxxxxxx> Subject: Re: [PATCH] Speed up sparse checkout when $GIT_DIR/info/sparse-checkout is unchanged Ping.. On Tue, Jul 12, 2016 at 1:15 AM, Nguyễn Thái Ngọc Duy <pclouds@xxxxxxxxx> wrote: > When a "tree unpacking" operation is needed, which is part of > switching branches using "git checkout", the following happens in a > sparse checkout: > > 1) Run all existing entries through $GIT_DIR/info/sparse-checkout, > mark entries that are to-be-excluded or to-be-included. > > 2) Do n-way merge stuff to add, modify and delete entries. > > 3) Run all new entries added at step 2 through > $GIT_DIR/info/sparse-checkout, mark entries that are to-be-excluded > or to-be-included. > > 4) Compare the current excluded/include status with the to-be-status > from steps 1 and 3, delete newly excluded entries from worktree and > add newly included ones to worktree. > > The "all existing entries" number in step 1 does not scale well. As > worktree gets bigger (or more sparse patterns are added), step 1 runs > slower. Which does not help because large worktrees are the reason > sparse-checkout is used, to keep the real worktree small again. > > If we know that $GIT_DIR/info/sparse-checkout has not changed, we know > that running checking again would result in the exact same > included/excluded as recorded in the current index because > "sparse-checkout" is the only input to the exclude machinery. In this > case, marking the to-be-status is simply copying the current status > over, which is a lot faster. > > The time breakdown of "git checkout" (no arguments) on webkit.git > (100k files) with a sparse checkout file of 4 negative patterns is > like this, where "sparse checkout loop #1" takes about 10% execution > time, which is the time saved after this patch. > > read-cache.c:1661 performance: 0.057816104 s: read cache .git/index > files-backend.c:1097 performance: 0.000023980 s: read packed refs > preload-index.c:104 performance: 0.039178367 s: preload index > read-cache.c:1260 performance: 0.002700730 s: refresh index > name-hash.c:128 performance: 0.030409968 s: initialize name hash > > unpack-trees.c:1173 performance: 0.100353572 s: sparse checkout loop #1 > > cache-tree.c:431 performance: 0.137213472 s: cache_tree_update > unpack-trees.c:1305 performance: 0.648923590 s: unpack_trees > read-cache.c:2139 performance: 0.074800165 s: write index, changed mask = 28 > unpack-trees.c:1305 performance: 0.137108835 s: unpack_trees > diff-lib.c:506 performance: 0.137152238 s: diff-index > trace.c:420 performance: 0.972682413 s: git command: 'git' 'checkout' > > Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@xxxxxxxxx> > --- > I mentioned about this some time ago and finally got curious enough > to try out. The saving is in the signficant range in my opinion, but > real world effect is probably lower (or much higher if you have so > many patterns in $GIT_DIR/info/sparse-checkout) > > Note that both cache_tree_update and sparse checkout loop #1 are part > of unpack_trees() so actual time spent on this function is more like > 0.4s. It's still a lot, but then this function is very scary to > optimize. > > Documentation/technical/index-format.txt | 6 +++++ > cache.h | 2 ++ > read-cache.c | 22 ++++++++++++++++- > unpack-trees.c | 42 ++++++++++++++++++++++++++++++-- > 4 files changed, 69 insertions(+), 3 deletions(-) > > diff --git a/Documentation/technical/index-format.txt > b/Documentation/technical/index-format.txt > index ade0b0c..3b0770a 100644 > --- a/Documentation/technical/index-format.txt > +++ b/Documentation/technical/index-format.txt > @@ -295,3 +295,9 @@ The remaining data of each directory block is grouped by type: > in the previous ewah bitmap. > > - One NUL. > + > +== Sparse checkout cache > + > + Sparse checkout extension saves the 20 bytes SHA-1 hash of > + $GIT_DIR/info/sparse-checkout at the time it is applied to the > + index. > diff --git a/cache.h b/cache.h > index f1dc289..cc4c2b1 100644 > --- a/cache.h > +++ b/cache.h > @@ -320,6 +320,7 @@ static inline unsigned int canon_mode(unsigned int mode) > #define CACHE_TREE_CHANGED (1 << 5) > #define SPLIT_INDEX_ORDERED (1 << 6) > #define UNTRACKED_CHANGED (1 << 7) > +#define SPARSE_CHECKOUT_CHANGED (1 << 8) > > struct split_index; > struct untracked_cache; > @@ -338,6 +339,7 @@ struct index_state { > struct hashmap dir_hash; > unsigned char sha1[20]; > struct untracked_cache *untracked; > + unsigned char sparse_checkout[20]; > }; > > extern struct index_state the_index; > diff --git a/read-cache.c b/read-cache.c index db27766..371d2c7 100644 > --- a/read-cache.c > +++ b/read-cache.c > @@ -40,11 +40,13 @@ static struct cache_entry > *refresh_cache_entry(struct cache_entry *ce, #define CACHE_EXT_RESOLVE_UNDO 0x52455543 /* "REUC" */ > #define CACHE_EXT_LINK 0x6c696e6b /* "link" */ > #define CACHE_EXT_UNTRACKED 0x554E5452 /* "UNTR" */ > +#define CACHE_EXT_SPARSE 0x5350434F /* "SPCO" */ > > /* changes that can be kept in $GIT_DIR/index (basically all > extensions) */ #define EXTMASK (RESOLVE_UNDO_CHANGED | CACHE_TREE_CHANGED | \ > CE_ENTRY_ADDED | CE_ENTRY_REMOVED | CE_ENTRY_CHANGED | \ > - SPLIT_INDEX_ORDERED | UNTRACKED_CHANGED) > + SPLIT_INDEX_ORDERED | UNTRACKED_CHANGED | \ > + SPARSE_CHECKOUT_CHANGED) > > struct index_state the_index; > static const char *alternate_index_output; @@ -1384,6 +1386,11 @@ > static int read_index_extension(struct index_state *istate, > case CACHE_EXT_UNTRACKED: > istate->untracked = read_untracked_extension(data, sz); > break; > + case CACHE_EXT_SPARSE: > + if (sz != sizeof(istate->sparse_checkout)) > + return error("bad %.4s extension", ext); > + hashcpy(istate->sparse_checkout, data); > + break; > default: > if (*ext < 'A' || 'Z' < *ext) > return error("index uses %.4s extension, which > we do not understand", @@ -1704,6 +1711,7 @@ int discard_index(struct index_state *istate) > discard_split_index(istate); > free_untracked_cache(istate->untracked); > istate->untracked = NULL; > + hashclr(&istate->sparse_checkout); > return 0; > } > > @@ -2101,6 +2109,18 @@ static int do_write_index(struct index_state *istate, int newfd, > if (err) > return -1; > } > + if (!strip_extensions && !is_null_sha1(istate->sparse_checkout)) { > + struct strbuf sb = STRBUF_INIT; > + > + strbuf_add(&sb, istate->sparse_checkout, > + sizeof(istate->sparse_checkout)); > + err = write_index_ext_header(&c, newfd, CACHE_EXT_SPARSE, > + sb.len) < 0 || > + ce_write(&c, newfd, sb.buf, sb.len) < 0; > + strbuf_release(&sb); > + if (err) > + return -1; > + } > > if (ce_flush(&c, newfd, istate->sha1) || fstat(newfd, &st)) > return -1; > diff --git a/unpack-trees.c b/unpack-trees.c index 6bc9512..f3916a9 > 100644 > --- a/unpack-trees.c > +++ b/unpack-trees.c > @@ -1080,6 +1080,25 @@ static void mark_new_skip_worktree(struct exclude_list *el, > select_flag, skip_wt_flag, el); } > > +static void get_sparse_checkout_hash(unsigned char *sha1) { > + struct stat st; > + int fd; > + > + hashclr(sha1); > + fd = open(git_path("info/sparse-checkout"), O_RDONLY); > + if (fd == -1) > + return; > + if (fstat(fd, &st)) { > + close(fd); > + return; > + } > + if (index_fd(sha1, fd, &st, OBJ_BLOB, > + git_path("info/sparse-checkout"), 0) < 0) > + hashclr(sha1); > + close(fd); > +} > + > static int verify_absent(const struct cache_entry *, > enum unpack_trees_error_types, > struct unpack_trees_options *); @@ -1094,6 > +1113,7 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options > int i, ret; > static struct cache_entry *dfc; > struct exclude_list el; > + unsigned char sparse_checkout_hash[20]; > > if (len > MAX_UNPACK_TREES) > die("unpack_trees takes at most %d trees", > MAX_UNPACK_TREES); @@ -1131,8 +1151,21 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options > /* > * Sparse checkout loop #1: set NEW_SKIP_WORKTREE on existing entries > */ > - if (!o->skip_sparse_checkout) > - mark_new_skip_worktree(o->el, o->src_index, 0, CE_NEW_SKIP_WORKTREE); > + if (!o->skip_sparse_checkout) { > + get_sparse_checkout_hash(sparse_checkout_hash); > + > + if (!is_null_sha1(sparse_checkout_hash) && > + !hashcmp(o->src_index->sparse_checkout, sparse_checkout_hash)) { > + struct index_state *istate = o->src_index; > + for (i = 0; i < istate->cache_nr; i++) { > + struct cache_entry *ce = istate->cache[i]; > + if (ce_skip_worktree(ce)) > + ce->ce_flags |= CE_NEW_SKIP_WORKTREE; > + } > + } else > + mark_new_skip_worktree(o->el, o->src_index, > + 0, CE_NEW_SKIP_WORKTREE); > + } > > if (!dfc) > dfc = xcalloc(1, cache_entry_size(0)); @@ -1236,6 > +1269,11 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options > ret = unpack_failed(o, "Sparse checkout leaves no entry on working directory"); > goto done; > } > + > + if (o->dst_index && !is_null_sha1(sparse_checkout_hash)) { > + hashcpy(o->result.sparse_checkout, sparse_checkout_hash); > + o->result.cache_changed |= SPARSE_CHECKOUT_CHANGED; > + } > } > > o->src_index = NULL; > -- > 2.8.2.537.g0965dd9 > -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html