(please note that this random string function is NOT the good way to do it,
i should random int then use it as index to an array containing all the letter)Thanks a lot for this new version!
It seems to be slower than your first solution (no index use I guess, I gave up after 5 minutes vs 5 sec for the previous). Morevover, I canno't make assumption about a fixed interval (2 sec in your example). But I think I see where you are going.After some test, the fastest is using BETWEEN and range.
(it is way faster than using the <@, strangely)
Here is the code :
-------------------------------------------------------
--usefull function to fill with random text
CREATE OR REPLACE FUNCTION rc_random_string(INTEGER )
RETURNS text AS $$
SELECT array_to_string(
ARRAY(
SELECT
substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789' FROM (random()*36)::int + 1 FOR 1)
FROM generate_series(1,$1)
)
,'')
$$ LANGUAGE sql;
--creating tables
DROP TABLE IF EXISTS a;
DROP TABLE IF EXISTS b;
create table a(gid int, t numeric, r numrange, data text);
create table b(gid int, t numeric, data text);
CREATE INDEX ON a (t);
CREATE INDEX ON b (t);
--CREATE INDEX ON a USING spgist (r);
CREATE INDEX ON a (r);
--filling tables
WITH the_serie AS (
SELECT s AS gid, s+random()/2-0.5 AS s, rc_random_string(100) aS data
FROM generate_series(1,100000) AS s
)
insert into a (gid, t,r, data) SELECT gid, s, numrange((lag(s,1) over(order by gid ASC))::numeric ,s::numeric) , data
FROM the_serie;
WITH the_serie AS (
SELECT s as gid, s+(random()-0.5)*2 AS s, rc_random_string(100) aS data
FROM generate_series(1,100000) AS s
)
insert into b (gid, t, data) SELECT gid,s, data
FROM the_serie;
ANALYZE a;
ANALYZE b;
--computing join with range
--slow : 80 sec
DROP TABLE IF EXISTS t;
CREATE TABLE t AS
SELECT b.*
FROM b LEFT JOIN a ON (b.t <@ a.r)
ORDER BY gid ASC
LIMIT 30
--slow: 80 sec
DROP TABLE IF EXISTS t;
CREATE TABLE t AS
SELECT b.*
FROM a,b
WHERE b.t <@a.r
--fast : 3sec
DROP TABLE IF EXISTS t;
CREATE TABLE t AS
SELECT b.* , a.data as d2
FROM a,b
WHERE b.t BETWEEN lower(a.r) AND upper(a.r)
--fast : 8 sec
DROP TABLE IF EXISTS t;
CREATE TABLE t AS
select a.t As a_t, b.t as b_t
from (
select t, least( least(t, mint), least(t, maxt)) as t2 from (
select t,
(select t from a where a.t >= b.t order by a.t limit 1) as mint,
(select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b
) as tmp
) as tmp2
inner join a on (tmp2.t2 = a.t)
inner join b on (tmp2.t = b.t)
-------------------------------------------------------
-------------------------------------------------------
--usefull function to fill with random text
CREATE OR REPLACE FUNCTION rc_random_string(INTEGER )
RETURNS text AS $$
SELECT array_to_string(
ARRAY(
SELECT
substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789' FROM (random()*36)::int + 1 FOR 1)
FROM generate_series(1,$1)
)
,'')
$$ LANGUAGE sql;
--creating tables
DROP TABLE IF EXISTS a;
DROP TABLE IF EXISTS b;
create table a(gid int, t numeric, r numrange, data text);
create table b(gid int, t numeric, data text);
CREATE INDEX ON a (t);
CREATE INDEX ON b (t);
--CREATE INDEX ON a USING spgist (r);
CREATE INDEX ON a (r);
--filling tables
WITH the_serie AS (
SELECT s AS gid, s+random()/2-0.5 AS s, rc_random_string(100) aS data
FROM generate_series(1,100000) AS s
)
insert into a (gid, t,r, data) SELECT gid, s, numrange((lag(s,1) over(order by gid ASC))::numeric ,s::numeric) , data
FROM the_serie;
WITH the_serie AS (
SELECT s as gid, s+(random()-0.5)*2 AS s, rc_random_string(100) aS data
FROM generate_series(1,100000) AS s
)
insert into b (gid, t, data) SELECT gid,s, data
FROM the_serie;
ANALYZE a;
ANALYZE b;
--computing join with range
--slow : 80 sec
DROP TABLE IF EXISTS t;
CREATE TABLE t AS
SELECT b.*
FROM b LEFT JOIN a ON (b.t <@ a.r)
ORDER BY gid ASC
LIMIT 30
--slow: 80 sec
DROP TABLE IF EXISTS t;
CREATE TABLE t AS
SELECT b.*
FROM a,b
WHERE b.t <@a.r
--fast : 3sec
DROP TABLE IF EXISTS t;
CREATE TABLE t AS
SELECT b.* , a.data as d2
FROM a,b
WHERE b.t BETWEEN lower(a.r) AND upper(a.r)
--fast : 8 sec
DROP TABLE IF EXISTS t;
CREATE TABLE t AS
select a.t As a_t, b.t as b_t
from (
select t, least( least(t, mint), least(t, maxt)) as t2 from (
select t,
(select t from a where a.t >= b.t order by a.t limit 1) as mint,
(select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b
) as tmp
) as tmp2
inner join a on (tmp2.t2 = a.t)
inner join b on (tmp2.t = b.t)
-------------------------------------------------------
Rémi-C
2014-04-11 20:18 GMT+02:00 Rémi Cura <remi.cura@xxxxxxxxx>:
And it's okay if 2 row from B share the same join to row from A, because when interpolating it will be different.(theoretically, for each row of B , compute the distance to every other row of A!)but I had discarded it for fear of suchbad complexityWow many thanks!I had thought about the order by and limit because it is the natural way to express the problem,
.
Here is the test env with realistic number, your solution is very fast, I have to raise my hat (4 sec!)-------------------------------------------------------
--usefull function to fill with random text
CREATE OR REPLACE FUNCTION rc_random_string(INTEGER )
RETURNS text AS $$
SELECT array_to_string(
ARRAY(
SELECT
substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789' FROM (random()*36)::int + 1 FOR 1)
FROM generate_series(1,$1)
)
,'')
$$ LANGUAGE sql;
--creating tables
DROP TABLE IF EXISTS a;
DROP TABLE IF EXISTS b;CREATE INDEX ON a (t);
create table a(t float, data text);
create table b(t float, data text);
CREATE INDEX ON b (t);
--filling tables
WITH the_serie AS (
SELECT s+random()/2 AS s, rc_random_string(100) aS data
FROM generate_series(1,100000) AS s
)
insert into a SELECT s, data
FROM the_serie;
WITH the_serie AS (
SELECT s+(random()-0.5)*2 AS s, rc_random_string(100) aS data
FROM generate_series(1,100000) AS s
)
insert into b SELECT s, data
FROM the_serie;
ANALYZE a;
ANALYZE b;
--computing result
DROP TABLE IF EXISTS t;
CREATE TABLE t AS
select a.t As a_t, b.t as b_t--------------------------------------------------------
from (
select t, least( least(t, mint), least(t, maxt)) as t2 from (
select t,
(select t from a where a.t >= b.t order by a.t limit 1) as mint,
(select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b
) as tmp
) as tmp2
inner join a on (tmp2.t2 = a.t)
inner join b on (tmp2.t = b.t)
2014-04-11 19:16 GMT+02:00 Andy Colson <andy@xxxxxxxxxxxxxxx>:
On 4/11/2014 7:50 AM, Rémi Cura wrote:
Ok, here is a just sql way. No ranges. No idea if its right. A first pass, so to speak.Hey dear List,
I'm looking for some advice about the best way to perform a "fuzzy"
join, that is joining two table based on approximate matching.
It is about temporal matching
given a table A with rows containing data and a control_time (for
instance 1 ; 5; 6; .. sec, not necessarly rounded of evenly-spaced)
given another table B with lines on no precise timing (eg control_time =
2.3 ; 5.8 ; 6.2 for example)
How to join every row of B to A based on
min(@(A.control_time-B.control_time))
(that is, for every row of B, get the row of A that is temporaly the
closest),
in an efficient way?
(to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6)
Optionnaly, how to get interpolation efficiently (meaning one has to get
the previous time and next time for 1 st order interpolation, 2 before
and 2 after for 2nd order interpolation, and so on)?
(to be explicit 5.8 would match to 5 and 6, the weight being 0.2 and 0.8
respectively)
Currently my data is spatial so I use Postgis function to interpolate a
point on a line, but is is far from efficient or general, and I don't
have control on interpolation (only the spatial values are interpolated).
Cheers,
Rémi-C
create table a(t float, data text);
create table b(t float, data text);
insert into a values (1), (5), (6);
insert into b values (2.3), (5.8), (6.2);
select a.t, b.t
from (
select t, least( least(t, mint), least(t, maxt)) as t2 from (
select t,
(select t from a where a.t >= b.t order by a.t limit 1) as mint,
(select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b
) as tmp
) as tmp2
inner join a on (tmp2.t2 = a.t)
inner join b on (tmp2.t = b.t)
The middle part is the magic:
select t,
(select t from a where a.t >= b.t order by a.t limit 1) as mint,
(select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b
The rest is just to make it usable. If t is indexed, it'll probably be fast too.
-Andy