Re: efficient way to do "fuzzy" join

Rémi Cura <remi.cura@xxxxxxxxx> · Fri, 11 Apr 2014 17:57:53 +0200

Hey,
thanks for your answer.

I think you are right, range type with index could at least provide a fast matching,
thus avoiding the numrow(A) * numrow(B) complexity .

Though I don't see how to use it to interpolate for more than 1st order.

Cheers,
Rémi-C

2014-04-11 17:09 GMT+02:00 Andy Colson <andy@xxxxxxxxxxxxxxx>:

On 4/11/2014 7:50 AM, Rémi Cura wrote:

Hey dear List,

I'm looking for some advice about the best way to perform a "fuzzy"

join, that is joining two table based on approximate matching.

It is about temporal matching

given a table A with rows containing data and a control_time (for

instance 1 ; 5; 6;  .. sec, not necessarly rounded of evenly-spaced)

given another table B with lines on no precise timing (eg control_time =

2.3 ; 5.8 ; 6.2 for example)

How to join every row of B to A based on

min(@(A.control_time-B.control_time))

(that is, for every row of B, get the row of A that is temporaly the

closest),

in an efficient way?

(to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6)

Optionnaly, how to get interpolation efficiently (meaning one has to get

the previous time and next time for 1 st order interpolation, 2 before

and 2 after for 2nd order interpolation, and so on)?

(to be explicit 5.8 would match to 5 and 6, the weight being 0.2 and 0.8

respectively)

Currently my data is spatial so I use Postgis function to interpolate a

point on a line, but is is far from efficient or general, and I don't

have control on interpolation (only the spatial values are interpolated).

Cheers,

Rémi-C

Have you seen the range type?

http://www.postgresql.org/docs/9.3/static/rangetypes.html

Not fuzzy, but is indexable.

-Andy