Re: efficient way to do "fuzzy" join

Andy Colson <andy@xxxxxxxxxxxxxxx> · Fri, 11 Apr 2014 13:12:15 -0500

On 4/11/2014 12:16 PM, Andy Colson wrote:
On 4/11/2014 7:50 AM, Rémi Cura wrote:
Hey dear List,

I'm looking for some advice about the best way to perform a "fuzzy"
join, that is joining two table based on approximate matching.

It is about temporal matching
given a table A with rows containing data and a control_time (for
instance 1 ; 5; 6;  .. sec, not necessarly rounded of evenly-spaced)

given another table B with lines on no precise timing (eg control_time =
2.3 ; 5.8 ; 6.2 for example)

How to join every row of B to A based on
min(@(A.control_time-B.control_time))
(that is, for every row of B, get the row of A that is temporaly the
closest),
in an efficient way?
(to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6)

Optionnaly, how to get interpolation efficiently (meaning one has to get
the previous time and next time for 1 st order interpolation, 2 before
and 2 after for 2nd order interpolation, and so on)?
(to be explicit 5.8 would match to 5 and 6, the weight being 0.2 and 0.8
respectively)

Currently my data is spatial so I use Postgis function to interpolate a
point on a line, but is is far from efficient or general, and I don't
have control on interpolation (only the spatial values are interpolated).

Cheers,
Rémi-C

Ok, here is a just sql way.  No ranges.  No idea if its right.  A first
pass, so to speak.

create table a(t float, data text);
create table b(t float, data text);

insert into a values (1), (5), (6);
insert into b values (2.3), (5.8), (6.2);

select a.t, b.t
from (
   select t, least( least(t, mint), least(t, maxt)) as t2 from (
     select t,
      (select t from a where a.t >= b.t order by a.t limit 1) as mint,
      (select t from a where a.t < b.t order by a.t desc limit 1) as maxt
   from b
   ) as tmp
) as tmp2
inner join a on (tmp2.t2 = a.t)
inner join b on (tmp2.t = b.t)

The middle part is the magic:

select t,
  (select t from a where a.t >= b.t order by a.t limit 1) as mint,
  (select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b

The rest is just to make it usable.  If t is indexed, it'll probably be
fast too.

-Andy

Here is a guess with ranges:

select a.t, (select t from b where b.t <@ numrange(a.t-2, a.t+2, '[]') 
order by abs(a.t-b.t) limit 1)
from a

It returns:
         t           t
----------  ----------
         1         2.3
         5         5.8
         6         5.8

which is different than the previous sql, but its not wrong.  6 is the 
same distance between 5.8 and 6.2, so both are the correct choice.

I had to change my tables (or type cast a lot):

create table a(t numeric);
create table b(t numeric);

insert into a values (1), (5), (6);
insert into b values (2.3), (5.8), (6.2);

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general