Tom, thanks for your reply. I wrote: > wisu-dev=# SELECT regexp_matches('quux@foo@bar.zip', '([@.]|[^@.]+)+', 'g'); > {p} > > wisu-dev=# SELECT regexp_matches('quux@foo@bar.zip', '([@.]|[^@.]+){1,2}', 'g'); > {@} > {@} > {.} > {p} > > wisu-dev=# SELECT regexp_matches('quux@foo@bar.zip', '([@.]|[^@.]+){1,3}', 'g'); > {foo} > {.} > {p} > > What's going on here?? FWIW, I have a vague idea of what these do: They match greedily, i.e., exactly as many instances of the subexpression as maximally allowed by the quantifier (and no less!), backtracking and regrouping word characters if necessary to get that many instances, and it always returns only the last of each tuple of sub-expressions, repeating until the string is exhausted. E.g., I think: '([@.]|[^@.]+){1,2}' matches (quux @ foo @ bar . zi p) and returns every second of those: @ @ . p '([@.]|[^@.]+){1,3}' matches (quux @ foo @ bar . z i p) and returns every third of those: foo . p '([@.]|[^@.]+)+' matches (q u u x @ f o o @ b a r . z i p) and returns every 16th of those: p I see that Perl behaves similarly, except for the trying to always match exactly as many instances of the subexpression as *maximally* allowed by the quantified, and backtracking if necessary for this to work. That last part is very, very weird. Tom Lane wrote: > These might be a bug, but the behavior doesn't seem to me that it'd be > terribly well defined in any case. The function should be pulling the > match to the parenthesized subexpression, but here that subexpression > has got multiple matches --- which one would you expect to get? I had *hoped* regexp_matches('quux@foo@bar.zip', '([@.]|[^@.]+)') (without 'g') would return all the subexpression matches as a *single* array in a *single* row. However, now that I've checked the Perl regexp engine's behavior, I would at least expect it to work just like Perl, i.e., allow fewer matches at the end, without tracking back and regrouping: $ perl -le 'print(join(" ", "quux\@foo\@bar.zip" =~ m/([@.]|[^@.]+){1,2}/g))' @ @ . zip $ perl -le 'print(join(" ", "quux\@foo\@bar.zip" =~ m/([@.]|[^@.]+){1,3}/g))' foo . zip > Instead of (foo)+ I'd try > ((foo+)) if you want all the matches > (foo)(foo)* if you want the first one > (?:foo)*(foo) if you want the last one I would use the ((foo+)) form, but of course it doesn't return all of the subexpression matches as separate elements, which was the point of my exercise. For what it's worth, I'm now using a "FOR ... IN SELECT regexp_matches(...) LOOP" construct in a custom plpgsql function. -Julian
Attachment:
signature.asc
Description: This is a digitally signed message part.