Am 09.05.20 um 22:27 schrieb Junio C Hamano: > René Scharfe <l.s.r@xxxxxx> writes: > >> Hmm, this could lead to quadratic behavior in the worst case, can't it? > > Oh, absolutely. But as you have shown, you'd need a specially > crafted tree with early entries that are prefixes of later ones, > which would rather be rare, and most of the time the bottom pointer > would advance by one every time we consume one path. > > So it is trading (hopefully rare) worst-case runtime with reduced > storage cost. > >> We could, however, reduce the names we add to the string_list to >> those that are possible candidates for conflict -- blobs followed by an >> entry whose name starts with the blob name followed by a dot and trees >> that follow an entry whose name matches in the same way. > > Yes, that is a valid solution that strikes a different balance > between allocation and runtime. fsck should ideally handle the worst cases gracefully -- I assume it has to deal with the weirdest "natural" corruptions and the "best" that pranksters and DoSers can come up with. > We may want to survey how commonly "bad" trees appear in real > projects. Depending on the result, we might want to update the > "limit re-scanning using the bottom pointer" hack we have been using > in the unpack-trees code. They are surprisingly common in Git's own repo. Ca. 74% of the trees in my clone had at least one candidate and the average number of candidates per tree in those was ca. 8.9. We have e.g. the t/ tree with its several tnnnn/ vs. tnnnn-foo.sh entries, or builtin/ vs. builtin.h. In the Linux repo ca. 20% of the checked trees have at least a candidate and the average number of candidates in those is ca. 2.4. So the patch for only adding possible candidates to the string_list is below (on top of my earlier one), but I wonder if we can do better. When we look for a d/f name clash we don't necessarily have to look back upon finding a candidate directory -- we can also scan ahead upon finding a candidate non-directory. That could be done using repeated update_tree_entry_gently() calls on a private copy. It wouldn't require any string_list allocations, but its worst case complexity would still be quadratic, of course. Would a stack work? When we see a candidate non-directory, we put it on the stack. When we see a candidate directory, we compare it to the entry at the top of the stack using strcmp(). Equality indicates a duplicate and we are done. If the directory name is less then we can pop the entry from the stack and check again, as we're past the point where a duplicate would be. Makes sense? The candidate stack solution would have linear complexity and require less memory than a full list of candidates. It would rely on the entries to be in the correct order (same as the patch below, come to think of it), but that's probably OK. We may miss DUPLICATE_ENTRIES (just like today :), but TREE_NOT_SORTED would still be reported. --- fsck.c | 25 +++++++++++++++++++++++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/fsck.c b/fsck.c index f47b35fee8..7d5bfef804 100644 --- a/fsck.c +++ b/fsck.c @@ -569,6 +569,25 @@ static int verify_ordered(unsigned mode1, const char *name1, unsigned mode2, con return c1 < c2 ? 0 : TREE_UNORDERED; } +/* + * Consecutive entries are checked for duplicates in verify_ordered(). + * However, there can be non-consecutive duplicates because a slash + * ('/') is appended implicitly to directory names during sorting, so + * there can be other names between a non-directory entry and a + * directory entry with the same name. E.g.: + * + * foo + * foo.bar + * foo/ + * + */ +static int may_be_df_dup(const char *candidate, const char *interloper) +{ + const char *p; + + return skip_prefix(interloper, candidate, &p) && '\0' < *p && *p < '/'; +} + static int fsck_tree(const struct object_id *oid, const char *buffer, unsigned long size, struct fsck_options *options) @@ -678,11 +697,14 @@ static int fsck_tree(const struct object_id *oid, default: break; } + if (!S_ISDIR(o_mode) && may_be_df_dup(o_name, name)) + string_list_append(&names, o_name); + if (S_ISDIR(mode) && may_be_df_dup(name, o_name)) + string_list_append(&names, name); } o_mode = mode; o_name = name; - string_list_append(&names, name); } nr = names.nr; @@ -1221,7 +1243,6 @@ int fsck_finish(struct fsck_options *options) free(buf); } - oidset_clear(&gitmodules_found); oidset_clear(&gitmodules_done); return ret; -- 2.26.2