Re: What's cooking in git.git (Oct 2012, #01; Tue, 2)

Michael Haggerty <mhagger@xxxxxxxxxxxx> · Thu, 04 Oct 2012 11:34:47 +0200

On 10/03/2012 08:17 PM, Junio C Hamano wrote:
> Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> writes:
> 
>> There's an interesting case: "**foo". According to our rules, that
>> pattern does not contain slashes therefore is basename match. But some
>> might find that confusing because "**" can match slashes,...
> 
> By "our rules", if you mean "if a pattern has slash, it is anchored",
> that obviously need to be updated with this series, if "**" is meant
> to match multiple hierarchies.
>> I think the latter makes more sense. When users put "**" they expect
>> to match some slashes. But that may call for a refactoring in
>> path_matches() in attr.c. Putting strstr(pattern, "**") in that
>> matching function may increase overhead unnecessarily.
>>
>> The third option is just die() and let users decide either "*foo",
>> "**/foo" or "/**foo", never "**foo".
> 
> For the double-star at the beginning, you should just turn it into "**/"
> if it is not followed by a slash internally, I think.
> 
> What is the semantics of ** in the first place?  Is it described to
> a reasonable level of detail in the documentation updates?  For
> example does "**foo" match "afoo", "a/b/foo", "a/bfoo", "a/foo/b",
> "a/bfoo/c"?  Does "x**y" match "xy", "xay", "xa/by", "x/a/y"?
> 
> I am guessing that the only sensible definition is that "**"
> requires anything that comes before it (if exists) is at a proper
> hierarchy boundary, and anything matches it is also at a proper
> hierarchy boundary, so "x**y" matches "x/a/y" and not "xy", "xay",
> nor "xa/by" in the above example.  If "x**y" can match "xy" or "xay"
> (or "**foo" can match "afoo"), it would be unreasonable to say it
> implies the pattern is anchored at any level, no?

Given that there is no obvious interpretation for what a construct like
"x**y" would mean, and many plausible guesses (most of which sound
rather useless), I suggest that we forbid it.  This will make the
feature easier to explain and make .gitignore files that use it easier
to understand.

I think that 98% of the usefulness of "**" would be in constructs where
it replaces a proper part of the pathname, like "**/SOMETHING" or
"SOMETHING/**/SOMETHING"; in other words, where its use matches the
regexp "(^|/)\*\*/".  In these constructs the only ambiguity is whether
"**/" matches regexp

    "([^/]+/)+"

or

    "([^/]+/)*"

(e.g., whether "foo/**/bar" matches "foo/bar").  I personally prefer the
second, because the first behavior can be had using the second
interpretation by using "SOMETHING/*/**/SOMETHING", whereas the second
behavior cannot be implemented in terms of the first in a single line of
the .gitignore file.

Optionally, one might also like to support "SOMETHING/**" or "**" alone
in the obvious ways.

As for the implementation, it is quite easy to textually convert a glob
pattern, including "**" parts, into a regexp.  I happen to have written
some Python code that does this for another project (see below).  An
obvious optimization would be to read any literal parts of the path off
the beginning of the glob pattern and only use regexps for the tail
part.  Would a regexp-based implementation be too slow?

Michael

_filename_char_pattern = r'[^/]'
_glob_patterns = [
    ('?', _filename_char_pattern),
    ('/**', r'(/.+)?'),
    ('**/', r'(.+/)?'),
    ('*', _filename_char_pattern + r'*'),
    ]

def glob_to_regexp(pattern):
    pattern = os.path.normpath(pattern) # remove trivial redundancies

    if pattern == '**':
        # This case has to be handled separately because it doesn't
        # involve a '/' character adjacent to the '**' pattern.  (Such
        # slashes otherwise have to be considered part of the pattern
        # to handle the matching of zero path components.)
        return re.compile(
            r'^' + _filename_char_pattern + r'(.+' +
_filename_char_pattern + r')?$'
            )

    regexp = [r'^']
    i = 0
    while i < len(pattern):
        for (s, r) in _glob_patterns:
            if pattern.startswith(s, i):
                regexp.append(r)
                i += len(s)
                break
        else:
            # AFAIK it's a normal character.  Escape it and add it to
            # pattern.
            regexp.append(re.escape(pattern[i]))
            i += 1

    regexp.append(r'$')

    return re.compile(''.join(regexp))

-- 
Michael Haggerty
mhagger@xxxxxxxxxxxx
http://softwareswirl.blogspot.com/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html