Very specialised query

Matthew Wakeling <matthew@xxxxxxxxxxx> · Thu, 26 Mar 2009 14:30:55 +0000 (GMT)

So, I have an query that has a very great difference between the possible 
performance and the achieved performance. I was wondering if there was a 
possibility that Postgres would approach the possible performance by some 
devious method.

The query returns a list of overlaps between objects. Each object is 
defined by a start position and end position and a foreign key to the 
thing that is is located on. It's genes on chromosomes, in case you were 
wondering. The relevant parts of the location table are as follows:

release-16.0-preview-14-mar=# \d location
        Table "public.location"
     Column      |  Type   | Modifiers
-----------------+---------+-----------
 end             | integer |
 start           | integer |
 objectid        | integer |
 id              | integer | not null
Indexes:
    "location__object" btree (objectid, id)
    "location__start" btree (start)
    "location_bioseg" gist (bioseg_create(intermine_start, intermine_end))

The table has 3490079 entries with 21875 distinct objectids, although the 
majority of entries are concentrated on about ten distinct objectids. The 
entire table is in cache.

The query to run is:

SELECT
    l1.id AS id1,
    l2.id AS id2
FROM
    location l1,
    location l2
WHERE
        l1.objectid = l2.objectid
    AND l1.id <> l2.id
    AND l1.start < l2.end
    AND l2.start < l1.end

The EXPLAIN result:
                                        QUERY PLAN
----------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..180896163.93 rows=54169773639 width=8)
   Join Filter: ((l1.id <> l2.id) AND (l1.start < l2.end) AND (l2.start < l1.end))
   ->  Seq Scan on location l1  (cost=0.00..78076.79 rows=3490079 width=16)
   ->  Index Scan using location__object on location l2  (cost=0.00..24.43 rows=1369 width=16)
         Index Cond: (l2.objectid = l1.objectid)
(5 rows)

I could get an EXPLAIN ANALYSE, but it would take quite a few hours.

Now, we have a spacial gist index on (start, end), using the bioseg addon, 
which works great for single overlap lookups, and does speed up the above 
query, if we alter it to have an equivalent meaning, but use the bioseg 
function:

SELECT
    l1.id AS id1,
    l2.id AS id2
FROM
    location l1,
    location l2
WHERE
        l1.objectid = l2.objectid
    AND l1.id <> l2.id
    AND bioseg_create(l1.start, l1.end) && bioseg_create(l2.start, l2.end);

The EXPLAIN result:
                                       QUERY PLAN
--------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..99982692.10 rows=4875280 width=8)
   Join Filter: ((l1.id <> l2.id) AND (l1.objectid = l2.objectid))
   ->  Seq Scan on location l1  (cost=0.00..78076.79 rows=3490079 width=16)
   ->  Index Scan using location_bioseg on location l2  (cost=0.00..12.92 rows=698 width=16)
         Index Cond: (bioseg_create(l1.start, l1.end) && bioseg_create(l2.start, l2.end))
(5 rows)

This query takes about two hours.

Now, it happens that there is an algorithm for calculating overlaps which 
is really quick. It involves iterating through the table in order of the 
start variable and keeping a list of ranges which "haven't ended yet". 
When you read the next range from the table, you firstly purge all the 
ranges from the list that end before the beginning of the new range. Then, 
you output a result row for each element in the list combined with the new 
range, then you add the new range to the list.

This algorithm just doesn't seem to fit into SQL at all. However, it would 
reduce two hours down to a few seconds. Any chances of it running in 
Postgres, or any other tips?

Matthew

--
Hi! You have reached 555-0129. None of us are here to answer the phone and 
the cat doesn't have opposing thumbs, so his messages are illegible. Please 
leave your name and message after the beep ...

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance