Re: invalid tree and commit object

René Scharfe <l.s.r@xxxxxx> · Sun, 10 May 2020 11:07:25 +0200

Am 09.05.20 um 22:27 schrieb Junio C Hamano:
> René Scharfe <l.s.r@xxxxxx> writes:
>
>> Hmm, this could lead to quadratic behavior in the worst case, can't it?
>
> Oh, absolutely.  But as you have shown, you'd need a specially
> crafted tree with early entries that are prefixes of later ones,
> which would rather be rare, and most of the time the bottom pointer
> would advance by one every time we consume one path.
>
> So it is trading (hopefully rare) worst-case runtime with reduced
> storage cost.
>
>> We could, however, reduce the names we add to the string_list to
>> those that are possible candidates for conflict -- blobs followed by an
>> entry whose name starts with the blob name followed by a dot and trees
>> that follow an entry whose name matches in the same way.
>
> Yes, that is a valid solution that strikes a different balance
> between allocation and runtime.

fsck should ideally handle the worst cases gracefully -- I assume it has
to deal with the weirdest "natural" corruptions and the "best" that
pranksters and DoSers can come up with.

> We may want to survey how commonly "bad" trees appear in real
> projects.  Depending on the result, we might want to update the
> "limit re-scanning using the bottom pointer" hack we have been using
> in the unpack-trees code.

They are surprisingly common in Git's own repo.  Ca. 74% of the trees
in my clone had at least one candidate and the average number of
candidates per tree in those was ca. 8.9.  We have e.g. the t/ tree
with its several tnnnn/ vs. tnnnn-foo.sh entries, or builtin/ vs.
builtin.h.

In the Linux repo ca. 20% of the checked trees have at least a
candidate and the average number of candidates in those is ca. 2.4.

So the patch for only adding possible candidates to the string_list
is below (on top of my earlier one), but I wonder if we can do
better.

When we look for a d/f name clash we don't necessarily have to look
back upon finding a candidate directory -- we can also scan ahead
upon finding a candidate non-directory.  That could be done using
repeated update_tree_entry_gently() calls on a private copy.  It
wouldn't require any string_list allocations, but its worst case
complexity would still be quadratic, of course.

Would a stack work?  When we see a candidate non-directory, we put
it on the stack.  When we see a candidate directory, we compare it
to the entry at the top of the stack using strcmp().  Equality
indicates a duplicate and we are done.  If the directory name is
less then we can pop the entry from the stack and check again, as
we're past the point where a duplicate would be.  Makes sense?

The candidate stack solution would have linear complexity and
require less memory than a full list of candidates.

It would rely on the entries to be in the correct order (same as
the patch below, come to think of it), but that's probably OK.
We may miss DUPLICATE_ENTRIES (just like today :), but
TREE_NOT_SORTED would still be reported.

---
 fsck.c | 25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/fsck.c b/fsck.c
index f47b35fee8..7d5bfef804 100644
--- a/fsck.c
+++ b/fsck.c
@@ -569,6 +569,25 @@ static int verify_ordered(unsigned mode1, const char *name1, unsigned mode2, con
 	return c1 < c2 ? 0 : TREE_UNORDERED;
 }

+/*
+ * Consecutive entries are checked for duplicates in verify_ordered().
+ * However, there can be non-consecutive duplicates because a slash
+ * ('/') is appended implicitly to directory names during sorting, so
+ * there can be other names between a non-directory entry and a
+ * directory entry with the same name.  E.g.:
+ *
+ *    foo
+ *    foo.bar
+ *    foo/
+ *
+ */
+static int may_be_df_dup(const char *candidate, const char *interloper)
+{
+	const char *p;
+
+	return skip_prefix(interloper, candidate, &p) && '\0' < *p && *p < '/';
+}
+
 static int fsck_tree(const struct object_id *oid,
 		     const char *buffer, unsigned long size,
 		     struct fsck_options *options)
@@ -678,11 +697,14 @@ static int fsck_tree(const struct object_id *oid,
 			default:
 				break;
 			}
+			if (!S_ISDIR(o_mode) && may_be_df_dup(o_name, name))
+				string_list_append(&names, o_name);
+			if (S_ISDIR(mode) && may_be_df_dup(name, o_name))
+				string_list_append(&names, name);
 		}

 		o_mode = mode;
 		o_name = name;
-		string_list_append(&names, name);
 	}

 	nr = names.nr;
@@ -1221,7 +1243,6 @@ int fsck_finish(struct fsck_options *options)
 		free(buf);
 	}

-
 	oidset_clear(&gitmodules_found);
 	oidset_clear(&gitmodules_done);
 	return ret;
--
2.26.2