Re: [PATCH v2 10/11] commit-graph: check all leading directories in changed path Bloom filters

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jun 25, 2020 at 11:05:04AM -0400, Derrick Stolee wrote:

> >> +	while (p > path) {
> >> +		if (is_dir_sep(*p))
> >> +			fill_bloom_key(path, p - path,
> >> +				       &revs->bloom_keys[path_component_nr++],
> >> +				       revs->bloom_filter_settings);
> >> +		p--;
> >> +	}
> > 
> > This walks the directory hierarchy upwards and adds bloom filters for
> > shorter and shorter paths, ("deepest first").  Good.
> > 
> > And it supports all directory separators.  On Windows that would be
> > slash (/) and backslash (\).  I assume paths are normalized to use
> > only slashes when bloom filters are written, correct?  Then the lookup
> > side needs to normalize a given path to only use slashes as well,
> > otherwise paths with backslashes cannot be found.  This part seems to
> > be missing.
> 
> Yes, that's a good point. We _require_ the paths to be normalized
> here to be Unix-style paths or else the Bloom filter keys are
> incorrect. Thankfully, they are.

Unfortunately, they aren't always...

Path normalization is done in normalize_path_copy_len(), whose
description says, among other things:

   * Performs the following normalizations on src, storing the result in dst:
   * - Ensures that components are separated by '/' (Windows only)

and the code indeed does:

        if (is_dir_sep(c)) {
                *dst++ = '/';

Now, while parsing pathspecs this function is called via:

  parse_pathspec()
    init_pathspec_item()
      prefix_path_gently()
        normalize_path_copy_len()

Unfortunately, init_pathspec_item() has this chain of conditions:

        /* Create match string which will be used for pathspec matching */
        if (pathspec_prefix >= 0) {
                match = xstrdup(copyfrom);
                prefixlen = pathspec_prefix;
        } else if (magic & PATHSPEC_FROMTOP) {
                match = xstrdup(copyfrom);
                prefixlen = 0;
        } else {
                match = prefix_path_gently(prefix, prefixlen,
                                           &prefixlen, copyfrom);
                if (!match) {
                        const char *hint_path = get_git_work_tree();
                        if (!hint_path)
                                hint_path = get_git_dir();
                        die(_("%s: '%s' is outside repository at '%s'"), elt,
                            copyfrom, absolute_path(hint_path));
                }
        }

which means that it doesn't always calls prefix_path_gently(), which,
in turn, means that 'pathspec_item->match' might remain un-normalized
in case of some unusual pathspecs.

The first condition is supposed to handle the case when one Git
process passes pathspecs to another, and is supposed to be "internal
use only"; see 233c3e6c59 (parse_pathspec: preserve prefix length via
PATHSPEC_PREFIX_ORIGIN, 2013-07-14), I haven't even tried to grok what
that might entail.

The second condition handles pathspecs explicitly relative to the root
of the work tree, i.e. ':/path'.  Adding a printf() to show the
original path and the resulting 'pathspec_item->match' does confirm
that no normalization is performed:

  expecting success of 9999.1 'test': 
          mkdir -p dir &&
          >dir/file &&
          git add ":/dir/file" &&
          git add ":(top)dir/file" &&
          test_might_fail git add ":/dir//file" &&
          git add ":(top)dir//file"
  
  orig:  ':/dir/file'
  match: 'dir/file'
  orig:  ':(top)dir/file'
  match: 'dir/file'
  orig:  ':/dir//file'
  match: 'dir//file'
  fatal: oops in prep_exclude
  orig:  ':(top)dir//file'
  match: 'dir//file'
  fatal: oops in prep_exclude
  not ok 1 - test

This is, of course, bad for Bloom filters, because the repeated
slashes are hashed as well and commits will be omitted from the output
of pathspec-limited revision walks, but apparently it also affects
other parts of Git.

And the else branch handles the rest, which, I believe, is by far the
most common case.

> Let's make that clear in-code by
> using '/' instead of is_dir_sep().
> 
> Thanks,
> -Stolee



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux