Re: [PATCH v4 6/6] regex.3: Destandardeseify Match offsets

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi!

On Thu, Apr 20, 2023 at 04:10:04PM +0200, Alejandro Colomar wrote:
> On 4/20/23 15:02, наб wrote:
> > --- a/man3/regex.3
> > +++ b/man3/regex.3
> > @@ -188,37 +188,34 @@ This flag is a BSD extension, not present in POSIX.
> >  .SS Match offsets
> >  Unless
> >  .B REG_NOSUB
> > -was set for the compilation of the pattern buffer, it is possible to
> > -obtain match addressing information.
> > -.I pmatch
> > -must be dimensioned to have at least
> > -.I nmatch
> > -elements.
> > -These are filled in by
> > +was passed to
> > +.BR regcomp (),
> > +it is possible to
> > +obtain the locations of matches within
> > +.IR string :
> >  .BR regexec ()
> > -with substring match addresses.
> > -The offsets of the subexpression starting at the
> > -.IR i th
> > -open parenthesis are stored in
> > -.IR pmatch[i] .
> > -The entire regular expression's match addresses are stored in
> > -.IR pmatch[0] .
> > -(Note that to return the offsets of
> > -.I N
> > -subexpression matches,
> > +fills
> >  .I nmatch
> > -must be at least
> > -.IR N+1 .)
> > -Any unused structure elements will contain the value \-1.
> > +elements of
> > +.I pmatch
> > +with results:
> > +.I pmatch[0]
> > +corresponds to the entire match,
> I still don't understand this.  Does REG_NOSUB also affect pmatch[0]?
> I would have expected that it would only affect *sub*matches, that is, [>0].

Let's consult the manual:
  REG_NOSUB  Do not report position of matches. [...]
  REG_NOSUB  Compile for matching that need only report success or
             failure, not what was matched.                    (4.4BSD)
and POSIX:
  REG_NOSUB  Report only success or fail in regexec().
  REG_NOSUB  Report only success/fail in regexec( ).
(yes; the two times it describes it, it's written differently).

POSIX says it better I think.

And, indeed:
	$ cat a.c
	#include <regex.h>
	#include <stdio.h>
	int main(int c, char ** v) {
		regex_t r;
		regcomp(&r, v[1], 0);
		regmatch_t dt = {0, 3};
		printf("%d\n", regexec(&r, v[2], 1, &dt, REG_STARTEND));
		printf("%d, %d\n", (int)dt.rm_so, (int)dt.rm_eo);
	}

	$ cc a.c -oac
	$ ./ac 'c$' 'abcdef'
	0
	2, 3

	$ sed 's/0)/REG_NOSUB)/' a.c | cc -xc - -oac
	$ ./ac 'c$' 'abcdef'
	0
	0, 3


...and I've just realised why you're asking ‒ I think you're reading too
much (and ahistorically) into the "SUB" bit;
heretofor I've assumed this is for "substitution", which I think is fair.

Actually, let's consult POSIX.2 (Draft 11.2):
  591     Table B-8  − regcomp() cflags Argument
  596  REG_NOSUB  Report only success/fail in regexec().
B.5 C Binding for Regular Expression Matching, B.5.2 Description:
  609  If the REG_NOSUB flag was not set in cflags, then regcomp() shall set re_nsub to
  610  the number of parenthesized subexpressions [delimited by \( \) in basic regular
  611  expressions or ( ) in extended regular expressions] found in pattern.
both as present-day.

B.5.5 Rationale., History of Decisions Made:
  791  The working group has rejected, at least for now, the inclusion of a regsub() func-
  792  tion that would be used to do substitutions for a matched regular expression.
  793  While such a routine would be useful to some applications, its utility would be
  794  much more limited than the matching function described here. Both regular
  795  expression parsing and substitution are possible to implement without support
  796  other than that required by the C Standard {7}, but matching is much more com-
  797  plex than substituting. The only ‘‘difficult’’ part of substitution, given the infor-
  798  mation supplied by regexec(), is finding the next character in a string when there
  799  can be multibyte characters. That is a much wider issue, and one that needs a
  800  more general solution.

  803  In Draft 9, the interface was modified so that the matched substrings rm_sp and
  804  rm_ep are in a separate regmatch_t structure instead of in regex_t. This allows a
  805  single compiled regular expression to be used simultaneously in several contexts;
  806  in main() and a signal handler, perhaps, or in multiple threads of lightweight
  807  processes. (The preg argument to regexec() is declared with type const, so the
  808  implementation is not permitted to use the structure to store intermediate
  809  results.) It also allows an application to request an arbitrary number of sub-
  810  strings from a regular expression. (Previous versions reported only ten sub-
  811  strings.) The number of subexpressions in the regular expression is reported in
  812  re_nsub in preg. With this change to regexec(), consideration was given to drop-
  813  ping the REG_NOSUB flag, since the user can now specify this with a zero nmatch
  814  argument to regexec(). However, keeping REG_NOSUB allows an implementation
  815  to use a different (perhaps more efficient) algorithm if it knows in regcomp() that
  816  no subexpressions need be reported. The implementation is only required to fill
  817  in pmatch if nmatch is not zero and if REG_NOSUB is not specified. Note that the
  818  size_t type, as defined in the C Standard {7}, is unsigned, so the description of
  819  regexec() does not need to address negative values of nmatch.

So: yes, there was a substitution interface that got cut.
The name is actually a hold-over from
"don't allocate for ten subexpressions in regex_t".

I think changing our description to
  REG_NOSUB  Only report overall success. regexec() will only use pmatch
             for REG_STARTEND, and ignore nmatch.
may make that more obvious.

Best,
наб

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Kernel Documentation]     [Netdev]     [Linux Ethernet Bridging]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux Admin]     [Samba]

  Powered by Linux