Re: [PATCH v5 4/8] regex.3: Improve REG_STARTEND

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Apr 21, 2023 at 01:01:11AM +0200, Alejandro Colomar wrote:
> On 4/20/23 21:30, наб wrote:
> > On Thu, Apr 20, 2023 at 07:29:27PM +0200, Alejandro Colomar wrote:
> >> On 4/20/23 17:35, наб wrote:
> >>> --- a/man3/regex.3
> >>> +++ b/man3/regex.3
> >>> @@ -131,23 +131,30 @@ .SS Matching
> >>>  above).
> >>>  .TP
> >>>  .B REG_STARTEND
> >>> -Use
> >>> -.I pmatch[0]
> >>> -on the input string, starting at byte
> >>> -.I pmatch[0].rm_so
> >>> -and ending before byte
> >>> -.IR pmatch[0].rm_eo .
> >>> +Match
> >>> +.RI [ string " + " pmatch[0].rm_so ", " string " + " pmatch[0].rm_eo )
> >>> +instead of
> >>> +.RI [ string ", " string " + \fBstrlen\fP(" string )).
> >>>  This allows matching embedded NUL bytes
> >>>  and avoids a
> >>>  .BR strlen (3)
> >>> -on large strings.
> >>> -It does not use
> >>> +on known-length strings.
> >>> +.I pmatch
> >>> +must point to a valid readable object.
> >> I think this is redundant, since we showed that [0] is accessed by
> >> the function.
> > Yeah.
> > 
> >>> +If any matches are returned
> >>> +.RB ( REG_NOSUB
> >>> +wasn't passed to
> >>> +.BR regcomp (),
> >>> +the match succeeded, and
> >>>  .I nmatch
> >>> -on input, and does not change
> >>> -.B REG_NOTBOL
> >>> -or
> >>> -.B REG_NEWLINE
> >>> -processing.
> >>> +> 0), they overwrite
> >> And of course, nmatch must be at least 1, since otherwise, [0] was
> >> not valid, and the whole call would have been UB; right?  So that
> >> third condition must be true to not invoke UB, so we can omit it too,
> >> I think.
> > What? idk where you got this from.
> > Per 0d120a3c76b4446b194a54387ce0e7a84b208bfd:
> >     In the regexec() signature
> >       regmatch_t pmatch[restrict .nmatch],
> >     is a simplification. It's actually
> >       regmatch_t pmatch[restrict
> >         ((.preg->flags & REG_NOSUB) ? 0 : .nmatch) ?:
> >          !!(.eflags & REG_STARTEND)],
> That is a model that was useful in a commit message to describe more
> or less what happens.  It doesn't need to perfectly describe reality.
> Since REG_STARTEND is not in POSIX, we can't read what POSIX says,
> so it's all up to how much implementations want to guarantee.  I
> don't think glibc would like to allow specifying .nmatch as 0 while
> the function accesses [0].  The fact that the current implementation
> doesn't open Hell's doors to nasal demons doesn't mean it can't do
> so in the future.  I conceive that _FORTIFY_SOURCE could reasonably
> check that pmatch[] has at least .nmemb elements, and I don't want
> to preclude that in the documentation.
What? I don't get this. Who cares what POSIX says about this 4.4BSD
extension?

This interface has been unchanged for over 30 years;
4.4BSD-Lite, /usr/src/lib/libc/regex/regexec.c:
    *      @(#)regexec.c   8.1 (Berkeley) 6/4/93

    int                             /* 0 success, REG_NOMATCH failure */
    regexec(preg, string, nmatch, pmatch, eflags)
    const regex_t *preg;
    const char *string;
    size_t nmatch;
    regmatch_t pmatch[];
    int eflags;
    {
            register struct re_guts *g = preg->re_g;
    #ifdef REDEBUG
    #       define  GOODFLAGS(f)    (f)
    #else
    #       define  GOODFLAGS(f)    ((f)&(REG_NOTBOL|REG_NOTEOL|REG_STARTEND))
    #endif
    
            if (preg->re_magic != MAGIC1 || g->magic != MAGIC2)
                    return(REG_BADPAT);
            assert(!(g->iflags&BAD));
            if (g->iflags&BAD)              /* backstop for no-debug case */
                    return(REG_BADPAT);
            if (eflags != GOODFLAGS(eflags))
                    return(REG_INVARG);
    
            if (g->nstates <= CHAR_BIT*sizeof(states1) && !(eflags&REG_LARGE))
                    return(smatcher(g, (char *)string, nmatch, pmatch, eflags));
            else
                    return(lmatcher(g, (char *)string, nmatch, pmatch, eflags));
    }

4.4BSD-Lite, /usr/src/lib/libc/regex/engine.c:
    *      @(#)engine.c    8.1 (Berkeley) 6/4/93

    /*
     * The matching engine and friends.  This file is #included by regexec.c
     * after suitable #defines of a variety of macros used herein, so that
     * different state representations can be used without duplicating masses
     * of code.
     */

    #ifdef SNAMES
    #define matcher smatcher
    
    #ifdef LNAMES
    #define matcher lmatcher
    
    /*
     - matcher - the actual matching engine
     == static int matcher(register struct re_guts *g, char *string, \
     ==     size_t nmatch, regmatch_t pmatch[], int eflags);
     */
    static int                      /* 0 success, REG_NOMATCH failure */
    matcher(g, string, nmatch, pmatch, eflags)
    register struct re_guts *g;
    char *string;
    size_t nmatch;
    regmatch_t pmatch[];
    int eflags;
    {
            register char *endp;
            register int i;
            struct match mv;
            register struct match *m = &mv;
            register char *dp;
            const register sopno gf = g->firststate+1;      /* +1 for OEND */
            const register sopno gl = g->laststate;
            char *start;
            char *stop;
    
            /* simplify the situation where possible */
            if (g->cflags&REG_NOSUB)
                    nmatch = 0;
            if (eflags&REG_STARTEND) {
                    start = string + pmatch[0].rm_so;
                    stop = string + pmatch[0].rm_eo;
            } else {
                    start = string;
                    stop = start + strlen(start);
            }
            if (stop < start)
                    return(REG_INVARG);
(rest of matcher)
            /* fill in the details if requested */
            if (nmatch > 0) {
                    pmatch[0].rm_so = m->coldp - m->offp;
                    pmatch[0].rm_eo = endp - m->offp;
            }
            if (nmatch > 1) {
                    assert(m->pmatch != NULL);
                    for (i = 1; i < nmatch; i++)
                            if (i <= m->g->nsub)
                                    pmatch[i] = m->pmatch[i];
                            else {
                                    pmatch[i].rm_so = -1;
                                    pmatch[i].rm_eo = -1;
                            }
            }

That's what the interface /is/ (also, I was guessing last time from
 behaviour and wrote the exact same pseudocode; fun).

And, tell you what, musl also does if(REG_NOSUB) nmatch = 0;
so does the illumos gate; glibc does
    int
    regexec (const regex_t *__restrict preg, const char *__restrict string,
             size_t nmatch, regmatch_t pmatch[_REGEX_NELTS (nmatch)], int eflags)
    {
      reg_errcode_t err;
      Idx start, length;
      re_dfa_t *dfa = preg->buffer;
    
      if (eflags & ~(REG_NOTBOL | REG_NOTEOL | REG_STARTEND))
        return REG_BADPAT;
    
      if (eflags & REG_STARTEND)
        {
          start = pmatch[0].rm_so;
          length = pmatch[0].rm_eo;
        }
      else
        {
          start = 0;
          length = strlen (string);
        }
    
      lock_lock (dfa->lock);
      if (preg->no_sub)
        err = re_search_internal (preg, string, length, start, length,
                                  length, 0, NULL, eflags);
      else
        err = re_search_internal (preg, string, length, start, length,
                                  length, nmatch, pmatch, eflags);
      lock_unlock (dfa->lock);
      return err != REG_NOERROR;
    }
i.e. it sets nmatch to 0 if REG_NOSUB, but later.
None of them do
  if (eflags & REG_STARTEND && !nmatch)
    ... what now? return an error?
for the sole purpose of... providing an interface that's broken?

nmatch is the amount of matches you care about getting back,
and nothing more.


If anything, the POSIX header is (Issue 8 Draft 2.1):
11030  The following shall be declared as functions and may also be defined as macros. Function
11031  prototypes shall be provided.
11032  int    regcomp(regex_t *restrict, const char *restrict, int);
11033  size_t regerror(int, const regex_t *restrict, char *restrict, size_t);
11034  int    regexec(const regex_t *restrict, const char *restrict, size_t,
11035            regmatch_t [restrict], int);
11036  void   regfree(regex_t *);


So you've overconstrained the interface for simplicity,
and now you're treating the simplification as a ground truth of..?

And glibc, if anything, would love for you to specify the start and end
bounds with REG_STARTEND while also passing nmatch = 0,
because it additionally optimises for that case (&& no backrefs).


/And also/, 6.7.6.2 Array declarators says:
  Constraints
  1 In addition to optional type qualifiers and the keyword static, the
  [ and ] may delimit an expression or *. If they delimit an expression
  (which specifies the size of an array), the expression shallhave an
  integer type. If the expression is a constant expression, it shall
  have a value greater thanzero. The element type shall not be an
  incomplete or function type. The optional type qualifiers and the
  keyword static shall appear only in a declaration of a function
  parameter with an array type, and then only in the outer most array
  type derivation.

  2 If an identifier is declared as having a variably modified type, it
  shall be an ordinary identifier (as defined in 6.2.3), have no
  linkage, and have either block scope or function prototype scope. If
  an identifier is declared to be an object with static or thread
  storage duration, it shall not have a variable length array type.

  Semantics
  3 If, in the declaration "T D1", D1 has one of the forms:
     D [        type-qualifier-list(opt)        assignment-expression(opt) ] attribute-specifier-sequence(opt)
     D [ static type-qualifier-list(opt)        assignment-expression      ] attribute-specifier-sequence(opt)
     D [        type-qualifier-list      static assignment-expression      ] attribute-specifier-sequence(opt)
     D [        type-qualifier-list(opt) *                                 ] attribute-specifier-sequence(opt)
  and the type specified for /ident/ in the declaration "T D" is
  "derived-declarator-type-list T", then the type specified for /ident/
  is "derived-declarator-type-list array of T".172)173) The optional
  attribute specifiersequence appertains to the array. (See 6.7.6.3 for
  the meaning of the optional type qualifiers and the keyword static.)

Where 6.7.6.3 Function declarators says:
  6 A declaration of a parameter as "array of /type/" shall be adjusted
  to "qualified pointer to /type/", wherethe /type/ qualifiers (if any)
  are those specified within the [ and ] of the array type derivation.
  If the keyword static also appears within the [ and ] of the array
  type derivation, then for each call to the function, the value of the
  corresponding actual argument shall provide access to the first
  element of an array with at least as many elements as specified by the
  size expression.

So even /if/ the declaration was
  int
  regexec(const regex_t *restrict, const char *restrict, size_t nmatch,
          regmatch_t pmatch[restrict static nmatch], int);
/which it isn't/, not even in glibc (#define _REGEX_NELTS(n) n, or to
empty depending on the environment, which means it's a regular
variably-modified array type, which means nothing when used in a
function prototype), it would /still/ be legal to do any of the below:
  regexec(regp, "", 0, NULL, 0);
  regmatch_t rm;
  regexec(regp, "", 0, &rm, 0);
  regexec(regp, "", 1, &rm, 0);
  regmatch_t rms[999];
  for(int i = 0; i < 999; ++i)
    regexec(regp, "", i, rms, 0);

More to the point, perhaps, 6.7.6.3 continues:
  20 EXAMPLE 5
  The following are all compatible function prototype declarators.
    double maximum(int n, int m, double a[n][m]);
    double maximum(int n, int m, double a[*][*]);
    double maximum(int n, int m, double a[ ][*]);
    double maximum(int n, int m, double a[ ][m]);
  as are:
    void f(double (*restrict a)[5]);
    void f(double a[restrict][5]);
    void f(double a[restrict 3][5]);
    void f(double a[restrict static 3][5]);
  (Note that the last declaration also specifies that the argument
  corresponding toain any call tofcan be expected to be a non-null
  pointer to the first of at least three arrays of 5 doubles,
  which the others do not.)

Which the others do not.

Well, it's not to the point since there's no static and there'll never
be static, but maybe it drives home that unless whatever's inside [] is
"restrict" or "static {expr}", it's purely decorative. And even with
static, you can always give it more objects. This is like saying that
       char *strncpy(char dst[restrict .sz], const char *restrict src,
                      size_t sz);
makes
    char dst[256 + 1]
	strncpy(dst, "whatever", 256);
illegal.


(There's also a forward-reffed stanza at 6.9.1.10, but I'm pretty sure
 it only applies to multi-dimensional VLAs.)


Best,
наб

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Kernel Documentation]     [Netdev]     [Linux Ethernet Bridging]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux Admin]     [Samba]

  Powered by Linux