Re: back references using regex

Michael Fuhr <mike@xxxxxxxx> · Thu, 8 Sep 2005 16:54:59 -0600

On Thu, Sep 08, 2005 at 01:52:35PM -0700, Matthew Peter wrote:
> It's not a complex regex as I have wrote one that does
> what I want, yet not at the database level. The docs
> didn't help clarify anything. I'm still not clear on
> how it determines where the back reference comes from
> in the previous example you gave. And why digits
> wouldn't work. 

Back references work as they usually do in regular expressions:
they refer to the matched value of a previous parenthesized
subexpression.  If you have multiple open parentheses then you
need to refer to the correct subexpression to get what you want.
Example:

CREATE TABLE foo (t text);
INSERT INTO foo VALUES ('abc.foo.foo.xyz');
INSERT INTO foo VALUES ('12.00.00.34');
INSERT INTO foo VALUES ('abc.def.ghi');

SELECT t FROM foo WHERE t ~ $$(\w+)\.\1$$;
        t        
-----------------
 abc.foo.foo.xyz
 12.00.00.34
(2 rows)

In the above query the regular expression has only one set of
parentheses, so the back reference refers to \1.  The result set
contains the two rows that have one or more word characters followed
by a dot followed by the same set of characters.

In the following query, note the difference between the regular
expression in the select list and the one in the where clause:

SELECT substring(t FROM $$((\w+)\.\2)$$)
FROM foo
WHERE t ~ $$(\w+)\.\1$$;
 substring 
-----------
 foo.foo
 00.00
(2 rows)

In the regular expression in the select list, we use the outermost
set of parentheses for grouping, so the back reference needs to
refer to the subexpression that begins with the second open
parenthesis (i.e., we must use \2).  In the regular expression
in the where clause, we have only one set of parentheses so the
back reference is \1.

Regarding digits, you didn't post any output in your example, so
we don't know if it really doesn't work or if it just doesn't do
what you were expecting.  Here's what I get from your examples:

SELECT substring('12.00.00.34' FROM $$((\d+)\.\2)$$);
 substring 
-----------
 00.00
(1 row)

SELECT substring('12.00.00.34' FROM $$((\w+)\.\2)$$);
 substring 
-----------
 00.00
(1 row)

SELECT substring('abc.foo.foo.xyz' FROM $$((\w+)\.\2)$$);
 substring 
-----------
 foo.foo
(1 row)

Do you get different results, or do these results surprise you?
They all appear to be correct.

> What I basically want to do is have a slice function
> like Python, where I can slice out items from a \s, \.
> or \n\n separated list. Where I'll just change the
> delimiter for the query that it applies. 
> 
> Where I could start it at a certain point and end it
> at another. Like slicing out paragraph 3-6 (delimiter
> \n\n) or the 2nd-6th sentence in a article (delimiter
> \.). That is what I am trying to do.

You can use split_part() to get a single item or string_to_array()
to build an array from which you can extract multiple items.

CREATE TABLE foo (t text);
INSERT INTO foo VALUES ('one.two.three.four.five.six.');

SELECT (string_to_array(t, '.'))[3:5] FROM foo;
  string_to_array  
-------------------
 {three,four,five}
(1 row)

SELECT array_to_string((string_to_array(t, '.'))[3:5], '.') FROM foo;
 array_to_string 
-----------------
 three.four.five
(1 row)

Is that what you're looking for?

-- 
Michael Fuhr

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster