Re: [PATCH 02/18] chainlint.pl: add POSIX shell lexical analyzer

Eric Sunshine <sunshine@xxxxxxxxxxxxxx> · Sat, 3 Sep 2022 02:00:11 -0400

On Thu, Sep 1, 2022 at 8:35 AM Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> wrote:
> On Thu, Sep 01 2022, Eric Sunshine via GitGitGadget wrote:
> Just generally on this series:
>
> > +     $tag =~ s/['"\\]//g;
>
> I think this would be a *lot* easier to read if all of these little
> regex decls could be split out into some "grammar" class, or other
> helper module/namespace. So e.g.:
>
>         my $SCRIPT_QUOTE_RX = qr/['"\\]/;

Taken out of context (as in the quoted snippet), it may indeed be
difficult to understand what that line is doing, however in context
with a meaningful function name:

    sub scan_heredoc_tag {
        ...
        my $tag = $self->scan_token();
        $tag =~ s/['"\\]//g;
        push(@{$self->{heretags}}, $indented ? "\t$tag" : "$tag");
        ...
    }

for someone who is familiar with common heredoc tag quoting/escaping
(i.e. <<'EOF', <<"EOF", <<\EOF), I find the inline character class
`['"\\]` much easier to understand than some opaque name such as
$SCRIPT_QUOTE_RX, doubly so because the definition of the named regex
might be far removed from the actual code which uses it, which would
require going and studying that definition before being able to
understand what this code is doing.

I grasp you made that name up on-the-fly as an example, but that does
highlight another reason why I'd be hesitant to try to pluck out and
name these regexes. Specifically, naming is hard and I don't trust
that I could come up with succinct meaningful names which would convey
what a regex does as well as the actual regex itself conveys what it
does. In context within the well-named function, `s/['"\\]//g` is
obviously stripping quoting/escaping from the tag name; trying to come
up with a succinct yet accurate name to convey that intention is
difficult. And this is just one example. The script is littered with
little regexes like this, and they are almost all unique, thus making
the task of inventing succinct meaningful names extra difficult. And,
as noted above, I'm not at all convinced that plucking the regex out
of its natural context -- thus making the reader go elsewhere to find
the definition of the regex -- would help improve comprehension.

> Then:
>
> > +     return $cc if $cc =~ /^(?:&&|\|\||>>|;;|<&|>&|<>|>\|)$/;
>
>         my $SCRIPT_WHATEVER_RX = qr/
>                 ^(?:
>                 &&
>                 |
>                 \|\|
>                 [...]
>         /x;
>
> etc., i.e. we could then make use of /x to add inline comments to these.

`/x` does make this slightly easier to grok, and this is a an example
of a regex which might be easy to name (i.e. $TWO_CHAR_OPERATOR), but
-- extra mandatory escaping aside -- it's not hard to understand this
one as-is; it's pretty obvious that it's looking for operators `&&`,
`||`, `>>`, `;;`, `<&`, `>&`, `<>`, and `>|`.