Am 09.05.20 um 19:28 schrieb Junio C Hamano: > René Scharfe <l.s.r@xxxxxx> writes: > >> So I got curious if such trees might be in popular repos, wrote the patch >> below and checked around a bit, but couldn't find any. >> >> Is there a smarter way to check for duplicates? One that doesn't need >> allocations? Perhaps by having a version of tree_entry_extract() that >> seeks backwards somehow? > > I've never looked into seeking backwards in a tree object, but in > unpack-trees, I had to deal with this exact problem that a blob > 'hello' sorts before 'hello.c' which in turn sorts after a tree > 'hello' because of the "implicit slash after the name for an entry > for a tree object" rule by introducing the "cache_bottom" hack in > the traversal logic to limit how far we must scan back. > > We may be able to limit the list of "seen recently" names in a > similar way. > > If the tree we are scanning has 'hello' (blob), 'hello.c' and > 'hellp', the bottom pointer initially would be at 'hello' (blob), > then stay there when we see 'hello.c' (because the next entry might > be 'hello' (tree) that would crash with 'hello'), and when we see > the entry 'hellp', we know that the entry at the bottom pointer > 'hello' (blob) cannot crash with any entry that comes later in the > tree object we are scanning, so we can advance the bottom pointer > forward. To decide if we can advance the bottom pointer beyond > 'hello.c' (blob), we see if 'hello.c' (tree) can appear after the > current entry we are looking at (i.e. 'hellp'), and we know it > cannot without violating the sort order. So the bottom would move > to point at 'hellp' we just saw. > > If we had 'hello' (tree) instead of 'hellp', when we look at it > after looking at 'hello' (blob) and 'hello.c', we scan from the > bottom pointer up to the previous entry, which is still pointing at > 'hello' (blob), and notice the crash. > Hmm, this could lead to quadratic behavior in the worst case, can't it? Imagine you have: a a.b a.b.c ... a.b.c/ a.b/ a/ If you have a single bottom pointer, it needs to stay at "a" the whole time, so that you can detect the d/f conflict with "a/" at the end. And for all entries in between you need to scan from "a" on and compare each entry -- quadratic. The blobs "a.b" and "a.b.c" don't need to actually be present, but we need to scan all entries to determine "a.b/" and "a.b.c/" are not conflicting anyway. Right? We could, however, reduce the names we add to the string_list to those that are possible candidates for conflict -- blobs followed by an entry whose name starts with the blob name followed by a dot and trees that follow an entry whose name matches in the same way. René