Re: Postgresql GROUP BY "SIMILAR" but not equal values

alexandros_e <alexandros.ef@xxxxxxxxx> · Thu, 6 Feb 2014 08:27:04 -0800 (PST)

Conceptually, Tom (as always) is right. But Alban's hack help.

DROP TABLE foo;
CREATE TABLE IF NOT EXISTS foo(ID INTEGER,G1 INTEGER, T1 TEXT, ID2 SERIAL
PRIMARY KEY);

INSERT INTO foo(ID,G1,T1) VALUES(1,2,'ABC');
INSERT INTO foo(ID,G1,T1) VALUES(1,2,'ABCD');
INSERT INTO foo(ID,G1,T1) VALUES(1,2,'ABDC');
INSERT INTO foo(ID,G1,T1) VALUES(1,2,'DEF');
INSERT INTO foo(ID,G1,T1) VALUES(1,2,'DEFH');

/* A little editing to remove duplicates a to b and b to a */
SELECT foo1.ID, foo1.G1, foo1.T1, levenshtein(foo1.T1, foo2.T1),foo2.T1
FROM foo foo1
INNER JOIN foo foo2 ON (foo2.ID = foo1.ID AND foo2.G1 = foo1.G1)
WHERE foo1.ID2<foo2.ID2
GROUP BY foo1.ID, foo1.G1, foo1.T1, levenshtein(foo1.T1, foo2.T1),foo2.T1
HAVING levenshtein(foo1.T1, foo2.T1) <2;

RETURNS

ID|G1|foo1.T1|foo2.T1
1;2;"ABC";1;"ABCD"
1;2;"ABC";1;"ABDC"
1;2;"ABCD";2;"ABDC"
1;2;"DEF";1;"DEFH"

Then it requires a second grouping but as Tom suggested it would be hard to
somehow group all similar cases together because then it becomes a
clustering problem. With a second grouping we will have 3 records instead of
4, so it is better than the initial case by 25%.   

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Postgresql-GROUP-BY-SIMILAR-but-not-equal-values-tp5790860p5790876.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general