Re: efficient way to do "fuzzy" join

Rémi Cura <remi.cura@xxxxxxxxx> · Fri, 11 Apr 2014 20:18:41 +0200

Wow many thanks!

I had thought about the order by and limit because it is the natural way to express the problem,
but I had discarded it for fear of suchbad complexity 

(theoretically, for each row of B , compute the distance to every other row of A!)
.

And it's okay if 2 row from B share the same join to row from A, because when interpolating it will be different.

Here is the test env with realistic number, your solution is very fast, I have to raise my hat (4 sec!)
-------------------------------------------------------

--usefull function to fill with random text

    CREATE OR REPLACE FUNCTION rc_random_string(INTEGER )
    RETURNS text AS $$
    SELECT array_to_string(
    ARRAY(
    SELECT
    substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789' FROM (random()*36)::int + 1 FOR 1)

    FROM generate_series(1,$1)
    )
    ,'')
    $$ LANGUAGE sql;

--creating tables
    DROP TABLE IF EXISTS a;
    DROP TABLE IF EXISTS b;
    create table a(t float, data text);

    create table b(t float, data text);
    CREATE INDEX ON a (t);
    CREATE INDEX ON b (t);

--filling tables
    WITH the_serie AS (
        SELECT  s+random()/2 AS s, rc_random_string(100) aS data

        FROM generate_series(1,100000) AS s

    )
    insert into a SELECT s, data
    FROM the_serie;

    WITH the_serie AS (
        SELECT  s+(random()-0.5)*2 AS s, rc_random_string(100) aS data

        FROM generate_series(1,100000) AS s

    )
    insert into b SELECT s, data
    FROM the_serie;

ANALYZE a;
ANALYZE b;

--computing result 
DROP TABLE IF EXISTS t;
CREATE TABLE t AS 

    select a.t As a_t, b.t as b_t
    from (
      select t, least( least(t, mint), least(t, maxt)) as t2 from (
        select t,
         (select t from a where a.t >= b.t order by a.t limit 1) as mint,

         (select t from a where a.t < b.t order by a.t desc limit 1) as maxt
      from b
      ) as tmp
    ) as tmp2
    inner join a on (tmp2.t2 = a.t)
    inner join b on (tmp2.t = b.t)
--------------------------------------------------------

2014-04-11 19:16 GMT+02:00 Andy Colson <andy@xxxxxxxxxxxxxxx>:

On 4/11/2014 7:50 AM, Rémi Cura wrote:

Hey dear List,

I'm looking for some advice about the best way to perform a "fuzzy"

join, that is joining two table based on approximate matching.

It is about temporal matching

given a table A with rows containing data and a control_time (for

instance 1 ; 5; 6;  .. sec, not necessarly rounded of evenly-spaced)

given another table B with lines on no precise timing (eg control_time =

2.3 ; 5.8 ; 6.2 for example)

How to join every row of B to A based on

min(@(A.control_time-B.control_time))

(that is, for every row of B, get the row of A that is temporaly the

closest),

in an efficient way?

(to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6)

Optionnaly, how to get interpolation efficiently (meaning one has to get

the previous time and next time for 1 st order interpolation, 2 before

and 2 after for 2nd order interpolation, and so on)?

(to be explicit 5.8 would match to 5 and 6, the weight being 0.2 and 0.8

respectively)

Currently my data is spatial so I use Postgis function to interpolate a

point on a line, but is is far from efficient or general, and I don't

have control on interpolation (only the spatial values are interpolated).

Cheers,

Rémi-C

Ok, here is a just sql way.  No ranges.  No idea if its right.  A first pass, so to speak.

create table a(t float, data text);

create table b(t float, data text);

insert into a values (1), (5), (6);

insert into b values (2.3), (5.8), (6.2);

select a.t, b.t

from (

  select t, least( least(t, mint), least(t, maxt)) as t2 from (

    select t,

     (select t from a where a.t >= b.t order by a.t limit 1) as mint,

     (select t from a where a.t < b.t order by a.t desc limit 1) as maxt

  from b

  ) as tmp

) as tmp2

inner join a on (tmp2.t2 = a.t)

inner join b on (tmp2.t = b.t)

The middle part is the magic:

select t,

 (select t from a where a.t >= b.t order by a.t limit 1) as mint,

 (select t from a where a.t < b.t order by a.t desc limit 1) as maxt

from b

The rest is just to make it usable.  If t is indexed, it'll probably be fast too.

-Andy