Re: efficient way to do "fuzzy" join

Andy Colson <andy@xxxxxxxxxxxxxxx> · Fri, 11 Apr 2014 11:10:52 -0500

2014-04-11 17:09 GMT+02:00 Andy Colson <andy@xxxxxxxxxxxxxxx
<mailto:andy@xxxxxxxxxxxxxxx>>:

    On 4/11/2014 7:50 AM, Rémi Cura wrote:

        Hey dear List,

        I'm looking for some advice about the best way to perform a "fuzzy"
        join, that is joining two table based on approximate matching.

        It is about temporal matching
        given a table A with rows containing data and a control_time (for
        instance 1 ; 5; 6;  .. sec, not necessarly rounded of evenly-spaced)

        given another table B with lines on no precise timing (eg
        control_time =
        2.3 ; 5.8 ; 6.2 for example)

        How to join every row of B to A based on
        min(@(A.control_time-B.__control_time))
        (that is, for every row of B, get the row of A that is temporaly the
        closest),
        in an efficient way?
        (to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6)

        Optionnaly, how to get interpolation efficiently (meaning one
        has to get
        the previous time and next time for 1 st order interpolation, 2
        before
        and 2 after for 2nd order interpolation, and so on)?
        (to be explicit 5.8 would match to 5 and 6, the weight being 0.2
        and 0.8
        respectively)

        Currently my data is spatial so I use Postgis function to
        interpolate a
        point on a line, but is is far from efficient or general, and I
        don't
        have control on interpolation (only the spatial values are
        interpolated).

        Cheers,
        Rémi-C

    Have you seen the range type?

    http://www.postgresql.org/__docs/9.3/static/rangetypes.__html
    <http://www.postgresql.org/docs/9.3/static/rangetypes.html>

    Not fuzzy, but is indexable.

    -Andy

On 4/11/2014 10:57 AM, Rémi Cura wrote:> Hey,
> thanks for your answer.
>
> I think you are right, range type with index could at least provide a
> fast matching,
> thus avoiding the numrow(A) * numrow(B) complexity .
>
> Though I don't see how to use it to interpolate for more than 1st order.
>
> Cheers,
> Rémi-C
>
>

Hum..  Would you like to set an upper bound on the number of seconds the 
join would match?  Maybe range types could give you an indexed upper 
bound ("match within +/- 2 seconds only"), then use another match to 
find the actual min.  (I do something like this in PostGis, use bounding 
box to do quick index lookup, then st_distance to find the nearest)

I can see two row's in A matching the same row in B.  Would that be ok?

TableA
------
1
5
6

TableB
------
0.9
1.1
6.6
7.7

How should those two tables join?

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general