Re: Regexp_replace bug / does not terminate on long strings

Tom Lane <tgl@xxxxxxxxxxxxx> · Thu, 19 Aug 2021 18:42:35 -0400

"Markhof, Ingolf" <ingolf.markhof@xxxxxxxxxxxxxx> writes:
> BRIEF:
> regexp_replace(source,pattern,replacement,flags) needs very (!) long to
> complete or does not complete at all (?!) for big input strings (a few k
> characters). (Oracle SQL completes the same in a few ms)

Regexps containing backrefs are inherently hard --- every engine has
strengths and weaknesses.  I doubt it'd be hard to find cases where
our engine is orders of magnitude faster than Oracle's; but you've
hit on a case where the opposite is true.

The core of the problem is that it's hard to tell how much of the
string could be matched by the (,\1)* subpattern.  In principle, *all*
of the remaining string could be, if it were N repetitions of the
initial word.  Or it could be N-1 repetitions followed by one other
word, and so on.  The difficulty is that since our engine guarantees
to find the longest feasible match, it tries these options from
longest to shortest.  Usually the actual match (if any) will be pretty
short, so that you have O(N) wasted work per word, making the runtime
at least O(N^2).

I think your best bet is to not try to eliminate multiple duplicates
at a time.  Get rid of one dup at a time, say by
     str := regexp_replace(str, '([^,]+)(,\1)?($|,)', '\1\3', 'g');
and repeat till the string doesn't get any shorter.

I did come across a performance bug [1] while poking at this, but
alas fixing it doesn't move the needle very much for this example.

			regards, tom lane

[1] https://www.postgresql.org/message-id/1808998.1629412269%40sss.pgh.pa.us