Very specialised query

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




So, I have an query that has a very great difference between the possible performance and the achieved performance. I was wondering if there was a possibility that Postgres would approach the possible performance by some devious method.

The query returns a list of overlaps between objects. Each object is defined by a start position and end position and a foreign key to the thing that is is located on. It's genes on chromosomes, in case you were wondering. The relevant parts of the location table are as follows:

release-16.0-preview-14-mar=# \d location
        Table "public.location"
     Column      |  Type   | Modifiers
-----------------+---------+-----------
 end             | integer |
 start           | integer |
 objectid        | integer |
 id              | integer | not null
Indexes:
    "location__object" btree (objectid, id)
    "location__start" btree (start)
    "location_bioseg" gist (bioseg_create(intermine_start, intermine_end))

The table has 3490079 entries with 21875 distinct objectids, although the majority of entries are concentrated on about ten distinct objectids. The entire table is in cache.

The query to run is:

SELECT
    l1.id AS id1,
    l2.id AS id2
FROM
    location l1,
    location l2
WHERE
        l1.objectid = l2.objectid
    AND l1.id <> l2.id
    AND l1.start < l2.end
    AND l2.start < l1.end

The EXPLAIN result:
                                        QUERY PLAN
----------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..180896163.93 rows=54169773639 width=8)
   Join Filter: ((l1.id <> l2.id) AND (l1.start < l2.end) AND (l2.start < l1.end))
   ->  Seq Scan on location l1  (cost=0.00..78076.79 rows=3490079 width=16)
   ->  Index Scan using location__object on location l2  (cost=0.00..24.43 rows=1369 width=16)
         Index Cond: (l2.objectid = l1.objectid)
(5 rows)

I could get an EXPLAIN ANALYSE, but it would take quite a few hours.

Now, we have a spacial gist index on (start, end), using the bioseg addon, which works great for single overlap lookups, and does speed up the above query, if we alter it to have an equivalent meaning, but use the bioseg function:

SELECT
    l1.id AS id1,
    l2.id AS id2
FROM
    location l1,
    location l2
WHERE
        l1.objectid = l2.objectid
    AND l1.id <> l2.id
    AND bioseg_create(l1.start, l1.end) && bioseg_create(l2.start, l2.end);

The EXPLAIN result:
                                       QUERY PLAN
--------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..99982692.10 rows=4875280 width=8)
   Join Filter: ((l1.id <> l2.id) AND (l1.objectid = l2.objectid))
   ->  Seq Scan on location l1  (cost=0.00..78076.79 rows=3490079 width=16)
   ->  Index Scan using location_bioseg on location l2  (cost=0.00..12.92 rows=698 width=16)
         Index Cond: (bioseg_create(l1.start, l1.end) && bioseg_create(l2.start, l2.end))
(5 rows)

This query takes about two hours.

Now, it happens that there is an algorithm for calculating overlaps which is really quick. It involves iterating through the table in order of the start variable and keeping a list of ranges which "haven't ended yet". When you read the next range from the table, you firstly purge all the ranges from the list that end before the beginning of the new range. Then, you output a result row for each element in the list combined with the new range, then you add the new range to the list.

This algorithm just doesn't seem to fit into SQL at all. However, it would reduce two hours down to a few seconds. Any chances of it running in Postgres, or any other tips?

Matthew

--
Hi! You have reached 555-0129. None of us are here to answer the phone and the cat doesn't have opposing thumbs, so his messages are illegible. Please leave your name and message after the beep ...

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux