So, I have an query that has a very great difference between the possible
performance and the achieved performance. I was wondering if there was a
possibility that Postgres would approach the possible performance by some
devious method.
The query returns a list of overlaps between objects. Each object is
defined by a start position and end position and a foreign key to the
thing that is is located on. It's genes on chromosomes, in case you were
wondering. The relevant parts of the location table are as follows:
release-16.0-preview-14-mar=# \d location
Table "public.location"
Column | Type | Modifiers
-----------------+---------+-----------
end | integer |
start | integer |
objectid | integer |
id | integer | not null
Indexes:
"location__object" btree (objectid, id)
"location__start" btree (start)
"location_bioseg" gist (bioseg_create(intermine_start, intermine_end))
The table has 3490079 entries with 21875 distinct objectids, although the
majority of entries are concentrated on about ten distinct objectids. The
entire table is in cache.
The query to run is:
SELECT
l1.id AS id1,
l2.id AS id2
FROM
location l1,
location l2
WHERE
l1.objectid = l2.objectid
AND l1.id <> l2.id
AND l1.start < l2.end
AND l2.start < l1.end
The EXPLAIN result:
QUERY PLAN
----------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..180896163.93 rows=54169773639 width=8)
Join Filter: ((l1.id <> l2.id) AND (l1.start < l2.end) AND (l2.start < l1.end))
-> Seq Scan on location l1 (cost=0.00..78076.79 rows=3490079 width=16)
-> Index Scan using location__object on location l2 (cost=0.00..24.43 rows=1369 width=16)
Index Cond: (l2.objectid = l1.objectid)
(5 rows)
I could get an EXPLAIN ANALYSE, but it would take quite a few hours.
Now, we have a spacial gist index on (start, end), using the bioseg addon,
which works great for single overlap lookups, and does speed up the above
query, if we alter it to have an equivalent meaning, but use the bioseg
function:
SELECT
l1.id AS id1,
l2.id AS id2
FROM
location l1,
location l2
WHERE
l1.objectid = l2.objectid
AND l1.id <> l2.id
AND bioseg_create(l1.start, l1.end) && bioseg_create(l2.start, l2.end);
The EXPLAIN result:
QUERY PLAN
--------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..99982692.10 rows=4875280 width=8)
Join Filter: ((l1.id <> l2.id) AND (l1.objectid = l2.objectid))
-> Seq Scan on location l1 (cost=0.00..78076.79 rows=3490079 width=16)
-> Index Scan using location_bioseg on location l2 (cost=0.00..12.92 rows=698 width=16)
Index Cond: (bioseg_create(l1.start, l1.end) && bioseg_create(l2.start, l2.end))
(5 rows)
This query takes about two hours.
Now, it happens that there is an algorithm for calculating overlaps which
is really quick. It involves iterating through the table in order of the
start variable and keeping a list of ranges which "haven't ended yet".
When you read the next range from the table, you firstly purge all the
ranges from the list that end before the beginning of the new range. Then,
you output a result row for each element in the list combined with the new
range, then you add the new range to the list.
This algorithm just doesn't seem to fit into SQL at all. However, it would
reduce two hours down to a few seconds. Any chances of it running in
Postgres, or any other tips?
Matthew
--
Hi! You have reached 555-0129. None of us are here to answer the phone and
the cat doesn't have opposing thumbs, so his messages are illegible. Please
leave your name and message after the beep ...
--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance