Hi, On 4/20/23 17:05, наб wrote: > Hi! > > On Thu, Apr 20, 2023 at 04:10:04PM +0200, Alejandro Colomar wrote: >> On 4/20/23 15:02, наб wrote: >>> --- a/man3/regex.3 >>> +++ b/man3/regex.3 >>> @@ -188,37 +188,34 @@ This flag is a BSD extension, not present in POSIX. >>> .SS Match offsets >>> Unless >>> .B REG_NOSUB >>> -was set for the compilation of the pattern buffer, it is possible to >>> -obtain match addressing information. >>> -.I pmatch >>> -must be dimensioned to have at least >>> -.I nmatch >>> -elements. >>> -These are filled in by >>> +was passed to >>> +.BR regcomp (), >>> +it is possible to >>> +obtain the locations of matches within >>> +.IR string : >>> .BR regexec () >>> -with substring match addresses. >>> -The offsets of the subexpression starting at the >>> -.IR i th >>> -open parenthesis are stored in >>> -.IR pmatch[i] . >>> -The entire regular expression's match addresses are stored in >>> -.IR pmatch[0] . >>> -(Note that to return the offsets of >>> -.I N >>> -subexpression matches, >>> +fills >>> .I nmatch >>> -must be at least >>> -.IR N+1 .) >>> -Any unused structure elements will contain the value \-1. >>> +elements of >>> +.I pmatch >>> +with results: >>> +.I pmatch[0] >>> +corresponds to the entire match, >> I still don't understand this. Does REG_NOSUB also affect pmatch[0]? >> I would have expected that it would only affect *sub*matches, that is, [>0]. > > Let's consult the manual: > REG_NOSUB Do not report position of matches. [...] > REG_NOSUB Compile for matching that need only report success or > failure, not what was matched. (4.4BSD) > and POSIX: > REG_NOSUB Report only success or fail in regexec(). > REG_NOSUB Report only success/fail in regexec( ). > (yes; the two times it describes it, it's written differently). > > POSIX says it better I think. > > And, indeed: > $ cat a.c > #include <regex.h> > #include <stdio.h> > int main(int c, char ** v) { > regex_t r; > regcomp(&r, v[1], 0); > regmatch_t dt = {0, 3}; > printf("%d\n", regexec(&r, v[2], 1, &dt, REG_STARTEND)); > printf("%d, %d\n", (int)dt.rm_so, (int)dt.rm_eo); > } > > $ cc a.c -oac > $ ./ac 'c$' 'abcdef' > 0 > 2, 3 > > $ sed 's/0)/REG_NOSUB)/' a.c | cc -xc - -oac > $ ./ac 'c$' 'abcdef' > 0 > 0, 3 > I like this example, and the quotes from POSIX. I'll link to your message in the commit log. > > ...and I've just realised why you're asking ‒ I think you're reading too > much (and ahistorically) into the "SUB" bit; [...] > Actually, let's consult POSIX.2 (Draft 11.2): [...] > 609 If the REG_NOSUB flag was not set in cflags, then regcomp() shall set re_nsub to > 610 the number of parenthesized subexpressions [delimited by \( \) in basic regular > 611 expressions or ( ) in extended regular expressions] found in pattern. > both as present-day. [...] > It also allows an application to request an arbitrary number of sub- > 810 strings from a regular expression. (Previous versions reported only ten sub- > 811 strings.) The number of subexpressions in the regular expression is reported in > 812 re_nsub in preg. [...] > > So: yes, there was a substitution interface that got cut. > The name is actually a hold-over from > "don't allocate for ten subexpressions in regex_t". So, the name indeed seems to come from "subexpressions", which confirms that it's just confusing as hell. > > I think changing our description to > REG_NOSUB Only report overall success. regexec() will only use pmatch > for REG_STARTEND, and ignore nmatch. > may make that more obvious. Yeah, this, and further the version in v8, makes the behavior clear, even if the name is brain-damaged (but there's nothing we can do about it :/). Cheers, Alex > > Best, > наб -- <http://www.alejandro-colomar.es/> GPG key fingerprint: A9348594CE31283A826FBDD8D57633D441E25BB5
Attachment:
OpenPGP_signature
Description: OpenPGP digital signature