Re: Can we make regexp processing more friendly by recognizing "\r\n" as a "newline" for "^$" purposes?

"David G. Johnston" <david.g.johnston@xxxxxxxxx> · Mon, 19 Oct 2015 09:15:05 -0400

On Mon, Oct 19, 2015 at 1:26 AM, Francisco Olarte <folarte@xxxxxxxxxxxxxx> wrote:
Hi David:

On Sun, Oct 18, 2015 at 7:49 PM, David G. Johnston

<david.g.johnston@xxxxxxxxx> wrote:

> Other implementation of regular expressions handle "newline" mechanics

> related to "^" and "$" semantically instead of literally.  By that I mean

> that both "\r\n" and "\n" are considered "newlines" instead of just "\n".

Which ones ? AFAIK this kind of thing is usually done by C ( and

related ) runtimes when reading text files.

In particular, Java.

There is a third-party program I use, RegEx Buddy, that also behaves in the way described.

At least in my machine perl does not do it:

censored:~$ perl -e 'print( ("A\r\n" =~ /A$/) ? "matched\n" : "NO MATCH\n");'

NO MATCH

censored:~$ perl -e 'print( ("A\r\n" =~ /A.$/) ? "matched\n" : "NO MATCH\n");'

matched

censored:~$ perl -e 'print( ("A\r\n" =~ /A\s$/) ? "matched\n" : "NO MATCH\n");'

matched

Yes; and I find this to be an annoyance as well...

Normally when reading lines in CP/M and related ( MSDOS, Windows ) the

CRT does collapse them ( and sometimes just zaps \r, or collapse any

run, or consider [\r*]\n[\r*] or.... ). But I normally do not see that

behaviour in regexes.

> If changing behavior is not desirable I would be content with another flag

> that would toggle such behavior.

> In code - both of these subqueries should match whereas presently only the

> first one does.

> SELECT regexp_matches(E'123\n',   E'123$', 'w');

> SELECT regexp_matches(E'123\r\n', E'123$', 'w');

> I don't know if this is server O/S dependent...but I would not expect it to

> be so.

Neither do I ( expect it to be os dep. ) , but I find the current

behaviour correct. I mean, newline stuff is OS dependent, and you

should convert when ingesting data, when matching them it should

already have been converted to whatever the language uses for newlines

( in C and perl that means \n, which needs not be \012, BTW . In unix

\n=\012 on disk, on CP/M it's \015\012 and when I worked with Mac (

before the unixy osX they use now ) it was \015, and I cannot think on

what they can use on EBCDIC machines ).

The current behavior is correct.  The behavior I describe, however, would be more user-friendly without being "incorrect".

Having started with, and still reliant upon external sources that use, Windows I've been (un)fortunate to develop habits where 99% of the time I do not have to care about line endings during the processing of data.  I'll pick up new habits eventually but not having to deal with a pre-process line-ending conversion step would make ad-hoc use of the PostgreSQL regex engine (TCL's) less cumbersome.

I'm hoping that Tom Lane at least chimes with his opinion given his recent work that area of the codebase is at least fresh in his mind.  Its not a huge deal but recent pain motivates me to at least put it out there.

David J.