Sorted union

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Using PostgreSQL 8.0.4.

I've got a table with 4.5 million rows that I expect to become huge (hundred million? who knows). Each row has a start and end time. I want to retrieve all the events during a timespan in one list; typical timespans will involve up to a couple rows. If the start and stop happen at the same time (which is possible), the start must come first in my sequence. So essentially, I want this:

    select    when_stopped as when_happened,
              1 as order_hint
    from      transaction t
    where     '2005-10-25 15:00:00' <= when_stopped
      and     when_stopped <= '2005-10-26 10:00:00'
    union all
    select    when_stopped as when_happened,
              2 as order_hint
    from      transaction t
    where     '2005-10-25 15:00:00' <= when_stopped
      and     when_stopped <= '2005-10-26 10:00:00'
    order by  when_happened, order_hint;

I'd really like the first row to be retrieved in O(1) time and the last in O(n) time (n = number of rows in the timespan, not the whole table). I previously was doing things manually with flat files. But there's a sort in PostgreSQL's plan, so I think I'm getting O(n log n) time for both. It's frustrating to start using a real database and get performance regressions.

QUERY PLAN ------------------------------------------------------------------------ ------------------------------------------------------------------------ --------------------------------- Sort (cost=667469.90..676207.19 rows=3494916 width=8) (actual time=28503.612..31377.254 rows=3364006 loops=1)
   Sort Key: when_happened, order_hint
-> Append (cost=0.00..194524.95 rows=3494916 width=8) (actual time=0.191..14659.712 rows=3364006 loops=1) -> Subquery Scan "*SELECT* 1" (cost=0.00..97262.48 rows=1747458 width=8) (actual time=0.190..5375.925 rows=1682003 loops=1) -> Index Scan using transaction_stopped on "transaction" t (cost=0.00..79787.90 rows=1747458 width=8) (actual time=0.186..2962.585 rows=1682003 loops=1) Index Cond: (('2005-10-25 15:00:00'::timestamp without time zone <= when_stopped) AND (when_stopped <= '2005-10-26 10:00:00'::timestamp without time zone)) -> Subquery Scan "*SELECT* 2" (cost=0.00..97262.48 rows=1747458 width=8) (actual time=0.167..5449.151 rows=1682003 loops=1) -> Index Scan using transaction_stopped on "transaction" t (cost=0.00..79787.90 rows=1747458 width=8) (actual time=0.163..3026.730 rows=1682003 loops=1) Index Cond: (('2005-10-25 15:00:00'::timestamp without time zone <= when_stopped) AND (when_stopped <= '2005-10-26 10:00:00'::timestamp without time zone))
Total runtime: 33312.814 ms
(10 rows)

Each side of the union is retrieved in sorted order, but it doesn't seem to realize that. There seem to be two things it's missing:

(1) It doesn't recognize that constant expressions are irrelevant to the sort. I.e., the first half of the union:

    select    when_started as when_happened,
              1 as order_hint
    from      transaction t
    where     '2005-10-25 15:00:00' <= when_started
      and     when_started <= '2005-10-26 10:00:00'
    order by  when_happened, order_hint;

does this:

QUERY PLAN ------------------------------------------------------------------------ ------------------------------------------------------------------------ --------------------- Sort (cost=291770.42..296139.05 rows=1747453 width=8) (actual time=8462.026..9895.715 rows=1681994 loops=1)
   Sort Key: when_started, 1
-> Index Scan using transaction_started on "transaction" t (cost=0.00..79788.21 rows=1747453 width=8) (actual time=0.190..2953.393 rows=1681994 loops=1) Index Cond: (('2005-10-25 15:00:00'::timestamp without time zone <= when_started) AND (when_started <= '2005-10-26 10:00:00'::timestamp without time zone))
Total runtime: 10835.114 ms
(5 rows)

The sort is unnecessary. If I take out the constant order_hint, it works:

    select    when_started as when_happened
    from      transaction t
    where     '2005-10-25 15:00:00' <= when_started
      and     when_started <= '2005-10-26 10:00:00'
    order by  when_happened;

QUERY PLAN ------------------------------------------------------------------------ ------------------------------------------------------------------------ --------------- Index Scan using transaction_started on "transaction" t (cost=0.00..79788.21 rows=1747453 width=8) (actual time=0.189..2715.513 rows=1681994 loops=1) Index Cond: (('2005-10-25 15:00:00'::timestamp without time zone <= when_started) AND (when_started <= '2005-10-26 10:00:00'::timestamp without time zone))
Total runtime: 3630.817 ms
(3 rows)

(2) It doesn't recognize that each half of the union is sorted and thus they only need to be merged. This is true even without the order_hint bits:

    select    when_stopped as when_happened
    from      transaction t
    where     '2005-10-25 15:00:00' <= when_stopped
      and     when_stopped <= '2005-10-26 10:00:00'
    union all
    select    when_stopped as when_happened
    from      transaction t
    where     '2005-10-25 15:00:00' <= when_stopped
      and     when_stopped <= '2005-10-26 10:00:00'
    order by  when_happened;

QUERY PLAN ------------------------------------------------------------------------ ------------------------------------------------------------------------ --------------------------------- Sort (cost=667469.90..676207.19 rows=3494916 width=8) (actual time=28088.783..30898.854 rows=3364006 loops=1)
   Sort Key: when_happened
-> Append (cost=0.00..194524.95 rows=3494916 width=8) (actual time=0.153..14410.485 rows=3364006 loops=1) -> Subquery Scan "*SELECT* 1" (cost=0.00..97262.48 rows=1747458 width=8) (actual time=0.152..5287.092 rows=1682003 loops=1) -> Index Scan using transaction_stopped on "transaction" t (cost=0.00..79787.90 rows=1747458 width=8) (actual time=0.149..2885.905 rows=1682003 loops=1) Index Cond: (('2005-10-25 15:00:00'::timestamp without time zone <= when_stopped) AND (when_stopped <= '2005-10-26 10:00:00'::timestamp without time zone)) -> Subquery Scan "*SELECT* 2" (cost=0.00..97262.48 rows=1747458 width=8) (actual time=0.152..5254.425 rows=1682003 loops=1) -> Index Scan using transaction_stopped on "transaction" t (cost=0.00..79787.90 rows=1747458 width=8) (actual time=0.149..2905.861 rows=1682003 loops=1) Index Cond: (('2005-10-25 15:00:00'::timestamp without time zone <= when_stopped) AND (when_stopped <= '2005-10-26 10:00:00'::timestamp without time zone))
Total runtime: 32766.566 ms
(10 rows)

Is there some way I can work around this? The best I can think of now is to open two connections, one for each half of the union. I can do the merge manually on the client side. It'd work, but I'd really prefer the database server take care of this for me.

--
Scott Lamb <http://www.slamb.org/>



[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux