Re: efficient way to do "fuzzy" join

Rémi Cura <remi.cura@xxxxxxxxx> · Mon, 14 Apr 2014 11:58:08 +0200

2014-04-12 15:04 GMT+02:00 Andy Colson <andy@xxxxxxxxxxxxxxx>:

On 04/12/2014 06:29 AM, Rémi Cura wrote:

(please note that this random string function is NOT the good way to

do it, i should random int then use it as index to an array

containing all the letter)

Thanks a lot for this new version! It seems to be slower than your

first solution (no index use I guess, I gave up after 5 minutes vs 5

sec for the previous). Morevover, I canno't make assumption about a

fixed interval (2 sec in your example). But I think I see where you

are going.

After some test, the fastest is using BETWEEN and range. (it is way

faster than using the <@, strangely)

Here is the code :

Ah, sorry about that.  I got pulled away to work on work stuff.  I was trying to figure out how to use an index on the range query, but not sure, without adding a new column if it would even work.

I've never had the need for ranges yet, this is the first time I've gotten to play with them.

I would not have thought about between like that, good call.  I'd have never guess it would be so fast.

If you can't use the fixed interval, then ranges are out.

I was thinking this could be improved:

select t,

 (select t from a where a.t >= b.t order by a.t limit 1) as mint,

 (select t from a where a.t < b.t order by a.t desc limit 1) as maxt

from b

It does two selects into a to find the nearest.  Given this:

create table a(t float);

insert into a values (1), (5), (6);

could you write a single query to find the number nearest 3.5?  If so we might cut the work by 50%.

-Andy

PS: This list prefers you don't top post.

Hey,
the best I can come up with using your original idea is :
------------------------------------------------------------------

--fast-ish: 10sec
DROP TABLE IF EXISTS t;
    CREATE TABLE t AS
    SELECT lower_b_a.gid AS gid_b, lower_b_a.t AS t_b --, lower_b_a.data AS data_b
        , lower_b_a.gid_l_b AS gid_a_lower , a1.t AS t_a_lower--, a1.data AS data_a_lower 

        , lower_b_a.gid_l_b -1 AS gid_a_upper , a2.t AS t_a_upper--, a2.data AS data_a_upper
    FROM     (
            SELECT b.gid, b.t 
                , (SELECT  gid  FROM a WHERE a.t>=b.t order by a.t ASC LIMIT 1  ) AS gid_l_b

            FROM b) as lower_b_a
        LEFT OUTTER JOIN a AS a1 ON (a1.gid = gid_l_b) LEFT OUTTER JOIN a AS a2 ON  (a2.gid = gid_l_b-1)
-------------------------------------------------------------------

As you suggested it doesn't read the table twice, but only once (to find the closest lower value). The closest upper value is found by knowing it is in the next row taht the closest lower value.

Yet it is still slower :-/

The way to go seems to be the numrange.

Thanks a lot for the help in this optimization !

Cheers,

Rémi-C