On Thu, Jun 25, 2020 at 11:05:04AM -0400, Derrick Stolee wrote: > >> + while (p > path) { > >> + if (is_dir_sep(*p)) > >> + fill_bloom_key(path, p - path, > >> + &revs->bloom_keys[path_component_nr++], > >> + revs->bloom_filter_settings); > >> + p--; > >> + } > > > > This walks the directory hierarchy upwards and adds bloom filters for > > shorter and shorter paths, ("deepest first"). Good. > > > > And it supports all directory separators. On Windows that would be > > slash (/) and backslash (\). I assume paths are normalized to use > > only slashes when bloom filters are written, correct? Then the lookup > > side needs to normalize a given path to only use slashes as well, > > otherwise paths with backslashes cannot be found. This part seems to > > be missing. > > Yes, that's a good point. We _require_ the paths to be normalized > here to be Unix-style paths or else the Bloom filter keys are > incorrect. Thankfully, they are. Unfortunately, they aren't always... Path normalization is done in normalize_path_copy_len(), whose description says, among other things: * Performs the following normalizations on src, storing the result in dst: * - Ensures that components are separated by '/' (Windows only) and the code indeed does: if (is_dir_sep(c)) { *dst++ = '/'; Now, while parsing pathspecs this function is called via: parse_pathspec() init_pathspec_item() prefix_path_gently() normalize_path_copy_len() Unfortunately, init_pathspec_item() has this chain of conditions: /* Create match string which will be used for pathspec matching */ if (pathspec_prefix >= 0) { match = xstrdup(copyfrom); prefixlen = pathspec_prefix; } else if (magic & PATHSPEC_FROMTOP) { match = xstrdup(copyfrom); prefixlen = 0; } else { match = prefix_path_gently(prefix, prefixlen, &prefixlen, copyfrom); if (!match) { const char *hint_path = get_git_work_tree(); if (!hint_path) hint_path = get_git_dir(); die(_("%s: '%s' is outside repository at '%s'"), elt, copyfrom, absolute_path(hint_path)); } } which means that it doesn't always calls prefix_path_gently(), which, in turn, means that 'pathspec_item->match' might remain un-normalized in case of some unusual pathspecs. The first condition is supposed to handle the case when one Git process passes pathspecs to another, and is supposed to be "internal use only"; see 233c3e6c59 (parse_pathspec: preserve prefix length via PATHSPEC_PREFIX_ORIGIN, 2013-07-14), I haven't even tried to grok what that might entail. The second condition handles pathspecs explicitly relative to the root of the work tree, i.e. ':/path'. Adding a printf() to show the original path and the resulting 'pathspec_item->match' does confirm that no normalization is performed: expecting success of 9999.1 'test': mkdir -p dir && >dir/file && git add ":/dir/file" && git add ":(top)dir/file" && test_might_fail git add ":/dir//file" && git add ":(top)dir//file" orig: ':/dir/file' match: 'dir/file' orig: ':(top)dir/file' match: 'dir/file' orig: ':/dir//file' match: 'dir//file' fatal: oops in prep_exclude orig: ':(top)dir//file' match: 'dir//file' fatal: oops in prep_exclude not ok 1 - test This is, of course, bad for Bloom filters, because the repeated slashes are hashed as well and commits will be omitted from the output of pathspec-limited revision walks, but apparently it also affects other parts of Git. And the else branch handles the rest, which, I believe, is by far the most common case. > Let's make that clear in-code by > using '/' instead of is_dir_sep(). > > Thanks, > -Stolee