On 6/26/2020 2:34 AM, SZEDER Gábor wrote: > On Thu, Jun 25, 2020 at 11:05:04AM -0400, Derrick Stolee wrote: > >>>> + while (p > path) { >>>> + if (is_dir_sep(*p)) >>>> + fill_bloom_key(path, p - path, >>>> + &revs->bloom_keys[path_component_nr++], >>>> + revs->bloom_filter_settings); >>>> + p--; >>>> + } >>> >>> This walks the directory hierarchy upwards and adds bloom filters for >>> shorter and shorter paths, ("deepest first"). Good. >>> >>> And it supports all directory separators. On Windows that would be >>> slash (/) and backslash (\). I assume paths are normalized to use >>> only slashes when bloom filters are written, correct? Then the lookup >>> side needs to normalize a given path to only use slashes as well, >>> otherwise paths with backslashes cannot be found. This part seems to >>> be missing. >> >> Yes, that's a good point. We _require_ the paths to be normalized >> here to be Unix-style paths or else the Bloom filter keys are >> incorrect. Thankfully, they are. > > Unfortunately, they aren't always... > > Path normalization is done in normalize_path_copy_len(), whose > description says, among other things: > > * Performs the following normalizations on src, storing the result in dst: > * - Ensures that components are separated by '/' (Windows only) > > and the code indeed does: > > if (is_dir_sep(c)) { > *dst++ = '/'; > > Now, while parsing pathspecs this function is called via: > > parse_pathspec() > init_pathspec_item() > prefix_path_gently() > normalize_path_copy_len() > > Unfortunately, init_pathspec_item() has this chain of conditions: > > /* Create match string which will be used for pathspec matching */ > if (pathspec_prefix >= 0) { > match = xstrdup(copyfrom); > prefixlen = pathspec_prefix; > } else if (magic & PATHSPEC_FROMTOP) { > match = xstrdup(copyfrom); > prefixlen = 0; > } else { > match = prefix_path_gently(prefix, prefixlen, > &prefixlen, copyfrom); > if (!match) { > const char *hint_path = get_git_work_tree(); > if (!hint_path) > hint_path = get_git_dir(); > die(_("%s: '%s' is outside repository at '%s'"), elt, > copyfrom, absolute_path(hint_path)); > } > } > > which means that it doesn't always calls prefix_path_gently(), which, > in turn, means that 'pathspec_item->match' might remain un-normalized > in case of some unusual pathspecs. > > The first condition is supposed to handle the case when one Git > process passes pathspecs to another, and is supposed to be "internal > use only"; see 233c3e6c59 (parse_pathspec: preserve prefix length via > PATHSPEC_PREFIX_ORIGIN, 2013-07-14), I haven't even tried to grok what > that might entail. > > The second condition handles pathspecs explicitly relative to the root > of the work tree, i.e. ':/path'. Adding a printf() to show the > original path and the resulting 'pathspec_item->match' does confirm > that no normalization is performed: > > expecting success of 9999.1 'test': > mkdir -p dir && > >dir/file && > git add ":/dir/file" && > git add ":(top)dir/file" && > test_might_fail git add ":/dir//file" && > git add ":(top)dir//file" > > orig: ':/dir/file' > match: 'dir/file' > orig: ':(top)dir/file' > match: 'dir/file' > orig: ':/dir//file' > match: 'dir//file' > fatal: oops in prep_exclude > orig: ':(top)dir//file' > match: 'dir//file' > fatal: oops in prep_exclude > not ok 1 - test > > This is, of course, bad for Bloom filters, because the repeated > slashes are hashed as well and commits will be omitted from the output > of pathspec-limited revision walks, but apparently it also affects > other parts of Git. > > And the else branch handles the rest, which, I believe, is by far the > most common case. Thanks for this analysis. Clearly, there is already a bug here when the input data is not pristine. I didn't see this message when I submitted my v3, but normalizing the path data before computing filters can (hopefully) be done as a small patch before or after my v3 PATCH 10 without much conflict. Thanks, -Stolee