Hi! On Thu, Apr 20, 2023 at 04:10:04PM +0200, Alejandro Colomar wrote: > On 4/20/23 15:02, наб wrote: > > --- a/man3/regex.3 > > +++ b/man3/regex.3 > > @@ -188,37 +188,34 @@ This flag is a BSD extension, not present in POSIX. > > .SS Match offsets > > Unless > > .B REG_NOSUB > > -was set for the compilation of the pattern buffer, it is possible to > > -obtain match addressing information. > > -.I pmatch > > -must be dimensioned to have at least > > -.I nmatch > > -elements. > > -These are filled in by > > +was passed to > > +.BR regcomp (), > > +it is possible to > > +obtain the locations of matches within > > +.IR string : > > .BR regexec () > > -with substring match addresses. > > -The offsets of the subexpression starting at the > > -.IR i th > > -open parenthesis are stored in > > -.IR pmatch[i] . > > -The entire regular expression's match addresses are stored in > > -.IR pmatch[0] . > > -(Note that to return the offsets of > > -.I N > > -subexpression matches, > > +fills > > .I nmatch > > -must be at least > > -.IR N+1 .) > > -Any unused structure elements will contain the value \-1. > > +elements of > > +.I pmatch > > +with results: > > +.I pmatch[0] > > +corresponds to the entire match, > I still don't understand this. Does REG_NOSUB also affect pmatch[0]? > I would have expected that it would only affect *sub*matches, that is, [>0]. Let's consult the manual: REG_NOSUB Do not report position of matches. [...] REG_NOSUB Compile for matching that need only report success or failure, not what was matched. (4.4BSD) and POSIX: REG_NOSUB Report only success or fail in regexec(). REG_NOSUB Report only success/fail in regexec( ). (yes; the two times it describes it, it's written differently). POSIX says it better I think. And, indeed: $ cat a.c #include <regex.h> #include <stdio.h> int main(int c, char ** v) { regex_t r; regcomp(&r, v[1], 0); regmatch_t dt = {0, 3}; printf("%d\n", regexec(&r, v[2], 1, &dt, REG_STARTEND)); printf("%d, %d\n", (int)dt.rm_so, (int)dt.rm_eo); } $ cc a.c -oac $ ./ac 'c$' 'abcdef' 0 2, 3 $ sed 's/0)/REG_NOSUB)/' a.c | cc -xc - -oac $ ./ac 'c$' 'abcdef' 0 0, 3 ...and I've just realised why you're asking ‒ I think you're reading too much (and ahistorically) into the "SUB" bit; heretofor I've assumed this is for "substitution", which I think is fair. Actually, let's consult POSIX.2 (Draft 11.2): 591 Table B-8 − regcomp() cflags Argument 596 REG_NOSUB Report only success/fail in regexec(). B.5 C Binding for Regular Expression Matching, B.5.2 Description: 609 If the REG_NOSUB flag was not set in cflags, then regcomp() shall set re_nsub to 610 the number of parenthesized subexpressions [delimited by \( \) in basic regular 611 expressions or ( ) in extended regular expressions] found in pattern. both as present-day. B.5.5 Rationale., History of Decisions Made: 791 The working group has rejected, at least for now, the inclusion of a regsub() func- 792 tion that would be used to do substitutions for a matched regular expression. 793 While such a routine would be useful to some applications, its utility would be 794 much more limited than the matching function described here. Both regular 795 expression parsing and substitution are possible to implement without support 796 other than that required by the C Standard {7}, but matching is much more com- 797 plex than substituting. The only ‘‘difficult’’ part of substitution, given the infor- 798 mation supplied by regexec(), is finding the next character in a string when there 799 can be multibyte characters. That is a much wider issue, and one that needs a 800 more general solution. 803 In Draft 9, the interface was modified so that the matched substrings rm_sp and 804 rm_ep are in a separate regmatch_t structure instead of in regex_t. This allows a 805 single compiled regular expression to be used simultaneously in several contexts; 806 in main() and a signal handler, perhaps, or in multiple threads of lightweight 807 processes. (The preg argument to regexec() is declared with type const, so the 808 implementation is not permitted to use the structure to store intermediate 809 results.) It also allows an application to request an arbitrary number of sub- 810 strings from a regular expression. (Previous versions reported only ten sub- 811 strings.) The number of subexpressions in the regular expression is reported in 812 re_nsub in preg. With this change to regexec(), consideration was given to drop- 813 ping the REG_NOSUB flag, since the user can now specify this with a zero nmatch 814 argument to regexec(). However, keeping REG_NOSUB allows an implementation 815 to use a different (perhaps more efficient) algorithm if it knows in regcomp() that 816 no subexpressions need be reported. The implementation is only required to fill 817 in pmatch if nmatch is not zero and if REG_NOSUB is not specified. Note that the 818 size_t type, as defined in the C Standard {7}, is unsigned, so the description of 819 regexec() does not need to address negative values of nmatch. So: yes, there was a substitution interface that got cut. The name is actually a hold-over from "don't allocate for ten subexpressions in regex_t". I think changing our description to REG_NOSUB Only report overall success. regexec() will only use pmatch for REG_STARTEND, and ignore nmatch. may make that more obvious. Best, наб
Attachment:
signature.asc
Description: PGP signature