Re: [PATCH v12 5/5] read-cache: speed up has_dir_name (part 2)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Apr 19, 2017 at 05:06:18PM +0000, git@xxxxxxxxxxxxxxxxx wrote:
> From: Jeff Hostetler <jeffhost@xxxxxxxxxxxxx>
> 
> Teach has_dir_name() to see if the path of the new item
> is greater than the last path in the index array before
> attempting to search for it.
> 
> has_dir_name() is looking for file/directory collisions
> in the index and has to consider each sub-directory
> prefix in turn.  This can cause multiple binary searches
> for each path.
> 
> During operations like checkout, merge_working_tree()
> populates the new index in sorted order, so we expect
> to be able to append in many cases.
> 
> This commit is part 2 of 2.  This commit handles the
> additional possible short-cuts as we look at each
> sub-directory prefix.
> 
> The net-net gains for add_index_entry_with_check() and
> both had_dir_name() commits are best seen for very
> large repos.
> 
> Here are results for an INFLATED version of linux.git
> with 1M files.
> 
> $ GIT_PERF_REPO=/mnt/test/linux_inflated.git/ ./run upstream/base HEAD ./p0006-read-tree-checkout.sh
> Test                                                            upstream/base      HEAD
> 0006.2: read-tree br_base br_ballast (1043893)                  3.79(3.63+0.15)    2.68(2.52+0.15) -29.3%
> 0006.3: switch between br_base br_ballast (1043893)             7.55(6.58+0.44)    6.03(4.60+0.43) -20.1%
> 0006.4: switch between br_ballast br_ballast_plus_1 (1043893)   10.84(9.26+0.59)   8.44(7.06+0.65) -22.1%
> 0006.5: switch between aliases (1043893)                        10.93(9.39+0.58)   10.24(7.04+0.63) -6.3%
> 
> Here are results for a synthetic repo with 4.2M files.
> 
> $ GIT_PERF_REPO=~/work/gfw/t/perf/repos/gen-many-files-10.4.3.git/ ./run HEAD~3 HEAD ./p0006-read-tree-checkout.sh
> Test                                                            HEAD~3               HEAD
> 0006.2: read-tree br_base br_ballast (4194305)                  29.96(19.26+10.50)   23.76(13.42+10.12) -20.7%
> 0006.3: switch between br_base br_ballast (4194305)             56.95(36.08+16.83)   45.54(25.94+15.68) -20.0%
> 0006.4: switch between br_ballast br_ballast_plus_1 (4194305)   90.94(51.50+31.52)   78.22(39.39+30.70) -14.0%
> 0006.5: switch between aliases (4194305)                        93.72(51.63+34.09)   77.94(39.00+30.88) -16.8%
> 
> Results for medium repos (like linux.git) are mixed and have
> more variance (probably do to disk IO unrelated to this test.
> 
> $ GIT_PERF_REPO=/mnt/test/linux.git/ ./run HEAD~3 HEAD ./p0006-read-tree-checkout.sh
> Test                                                          HEAD~3             HEAD
> 0006.2: read-tree br_base br_ballast (57994)                  0.25(0.21+0.03)    0.20(0.17+0.02) -20.0%
> 0006.3: switch between br_base br_ballast (57994)             10.67(6.06+2.92)   10.51(5.94+2.91) -1.5%
> 0006.4: switch between br_ballast br_ballast_plus_1 (57994)   0.59(0.47+0.16)    0.52(0.40+0.13) -11.9%
> 0006.5: switch between aliases (57994)                        0.59(0.44+0.17)    0.51(0.38+0.14) -13.6%
> 
> $ GIT_PERF_REPO=/mnt/test/linux.git/ ./run HEAD~3 HEAD ./p0006-read-tree-checkout.sh
> Test                                                          HEAD~3             HEAD
> 0006.2: read-tree br_base br_ballast (57994)                  0.24(0.21+0.02)    0.21(0.18+0.02) -12.5%
> 0006.3: switch between br_base br_ballast (57994)             10.42(5.98+2.91)   10.66(5.86+3.09) +2.3%
> 0006.4: switch between br_ballast br_ballast_plus_1 (57994)   0.59(0.49+0.13)    0.53(0.37+0.16) -10.2%
> 0006.5: switch between aliases (57994)                        0.59(0.43+0.17)    0.50(0.37+0.14) -15.3%
> 
> Results for smaller repos (like git.git) are not significant.
> $ ./run HEAD~3 HEAD ./p0006-read-tree-checkout.sh
> Test                                                         HEAD~3            HEAD
> 0006.2: read-tree br_base br_ballast (3043)                  0.01(0.00+0.00)   0.01(0.00+0.00) +0.0%
> 0006.3: switch between br_base br_ballast (3043)             0.31(0.17+0.11)   0.29(0.19+0.08) -6.5%
> 0006.4: switch between br_ballast br_ballast_plus_1 (3043)   0.03(0.02+0.00)   0.03(0.02+0.00) +0.0%
> 0006.5: switch between aliases (3043)                        0.03(0.02+0.00)   0.03(0.02+0.00) +0.0%
> 
> Signed-off-by: Jeff Hostetler <jeffhost@xxxxxxxxxxxxx>
> ---
>  read-cache.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 62 insertions(+), 1 deletion(-)
> 
> diff --git a/read-cache.c b/read-cache.c
> index 9af0bd4..c252b6c 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -965,7 +965,7 @@ static int has_dir_name(struct index_state *istate,
>  	}
>  
>  	for (;;) {
> -		int len;
> +		size_t len;
>  
>  		for (;;) {
>  			if (*--slash == '/')
> @@ -975,6 +975,67 @@ static int has_dir_name(struct index_state *istate,
>  		}
>  		len = slash - name;
>  
> +		if (cmp_last > 0) {
> +			/*
> +			 * (len + 1) is a directory boundary (including
> +			 * the trailing slash).  And since the loop is
> +			 * decrementing "slash", the first iteration is
> +			 * the longest directory prefix; subsequent
> +			 * iterations consider parent directories.
> +			 */
> +
> +			if (len + 1 <= len_eq_last) {
> +				/*
> +				 * The directory prefix (including the trailing
> +				 * slash) also appears as a prefix in the last
> +				 * entry, so the remainder cannot collide (because
> +				 * strcmp said the whole path was greater).
> +				 *
> +				 * EQ: last: xxx/A
> +				 *     this: xxx/B
> +				 *
> +				 * LT: last: xxx/file_A
> +				 *     this: xxx/file_B
> +				 */
> +				return retval;
> +			}
> +
> +			if (len > len_eq_last) {
> +				/*
> +				 * This part of the directory prefix (excluding
> +				 * the trailing slash) is longer than the known
> +				 * equal portions, so this sub-directory cannot
> +				 * collide with a file.
> +				 *
> +				 * GT: last: xxxA
> +				 *     this: xxxB/file
> +				 */
> +				return retval;
> +			}
> +
> +			if (istate->cache_nr > 0 &&
> +				ce_namelen(istate->cache[istate->cache_nr - 1]) > len) {
> +				/*
> +				 * The directory prefix lines up with part of
> +				 * a longer file or directory name, but sorts
> +				 * after it, so this sub-directory cannot
> +				 * collide with a file.
> +				 *
> +				 * last: xxx/yy-file (because '-' sorts before '/')
> +				 * this: xxx/yy/abc

This is problematic, because the index can already contain 'xxx/yy' as
a file, when adding 'xxx/yy/abc', but since 'xxx/yy' as a file sorts
before 'xxx/yy-file', the short-circuiting here doesn't see it and
thus leaves the d-f collision undetected.  Consequently, even Git
porcelain commands can create tree objects with duplicate entries, as
demonstrated in the tests below.

Before this patch the collision is noticed, the colliding file entry
is removed, and these tests succeed.  Removing this condition make
them succeed as well, and FWIW the test suite still passes, but I
haven't thought it through and haven't checked the impact on
performance.

  ---  >8  ---

diff --git a/t/t9999-test.sh b/t/t9999-test.sh
new file mode 100755
index 0000000000..4c802d5940
--- /dev/null
+++ b/t/t9999-test.sh
@@ -0,0 +1,47 @@
+#!/bin/sh
+
+test_description='test'
+
+. ./test-lib.sh
+
+test_expect_success 'file to dir breakage' '
+	>file-to-dir &&
+	# This sorts between "file-to-dir" as a file and "file-to-dir"
+	# as a directory (with the trailing / appended implicitly.
+	>file-to-dir.uh-oh &&
+	git add file-to-dir file-to-dir.uh-oh &&
+	git commit -m "add a file" &&
+
+	# Not "git rm", to preserve "file-to-dir" in the index.
+	rm file-to-dir &&
+	mkdir file-to-dir &&
+	>file-to-dir/file &&
+
+	# It is important to add the file in the directory; adding only
+	# the directory doesnt trigger the bug.
+	git add file-to-dir/file &&
+
+	git diff --cached --no-renames --name-status >actual &&
+
+	cat >expect <<-\EOF &&
+	D	file-to-dir
+	A	file-to-dir/file
+	EOF
+	test_cmp expect actual
+'
+
+test_expect_success 'is it committed as-is?' '
+	git commit -m "replace file with a dir" &&
+	git ls-tree HEAD >actual &&
+
+	# Hardcoded SHA-1 oid :(, because with this bug present
+	# "git rev-parse HEAD:file-to-dir" doesnt show the oid of
+	# the tree, but the oid of the blob that shouldnt be there.
+	cat >expect <<-EOF &&
+	100644 blob $EMPTY_BLOB	file-to-dir.uh-oh
+	040000 tree df2b8fc99e1c1d4dbc0a854d9f72157f1d6ea078	file-to-dir
+	EOF
+	test_cmp expect actual
+'
+
+test_done

  ---  8<  ---

The first test fails with:

  + test_cmp expect actual
  --- expect      2020-07-04 16:52:07.173296314 +0000
  +++ actual      2020-07-04 16:52:07.173296314 +0000
  @@ -1,2 +1 @@
  -D      file-to-dir
   A      file-to-dir/file
  error: last command exited with $?=1
  not ok 1 - file to dir breakage
  
And the second:

  + test_cmp expect actual
  --- expect      2020-07-04 16:52:07.185296659 +0000
  +++ actual      2020-07-04 16:52:07.181296544 +0000
  @@ -1,2 +1,3 @@
  +100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391   file-to-dir
   100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391   file-to-dir.uh-oh
   040000 tree df2b8fc99e1c1d4dbc0a854d9f72157f1d6ea078   file-to-dir
  error: last command exited with $?=1
  not ok 2 - is it committed as-is?


> +				 */
> +				return retval;
> +			}
> +
> +			/*
> +			 * This is a possible collision. Fall through and
> +			 * let the regular search code handle it.
> +			 *
> +			 * last: xxx
> +			 * this: xxx/file
> +			 */
> +		}
> +
>  		pos = index_name_stage_pos(istate, name, len, stage);
>  		if (pos >= 0) {
>  			/*
> -- 
> 2.9.3
> 



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux