Query optimization using order by and limit

Michael Viscuso <michael.viscuso@xxxxxxxxxxxxxxxxxx> · Wed, 21 Sep 2011 19:14:09 -0400

First of all, thank you for taking the time to review my question.  After attending the PostgresOpen conference in Chicago last week, I've been pouring over explain logs for hours on end and although my system is MUCH better, I still can't resolve a few issues.  Luckily my data is pretty well structured so solving one issue will likely solve many more so I'll start with this one.

Version: PostgreSQL 9.1rc1, compiled by Visual C++ build 1500, 64-bit
OS: Windows 7 64-bit
ORM: SQLAlchemy
Postgres table structure: I have daily partitioned tables for each of 4 "core tables" (the tables with the majority of my application's data).  Each daily table inherits from its parent.  I do not explicitly define a REFERENCE between these tables because I cannot guarantee the order in which the events are inserted into the database, but where there are references, the referenced row should exist in the other's daily table.  The reason I partitioned the data in this manner is to increase query speed and make it easy to archive old data. (I'm new to high-end Postgres performance so there's likely several fundamental flaws in my assumptions.  I won't turn down any recommendation.) 

An example of a daily partitioned table follows:

cb=# \d osmoduleloads_2011_09_14;
                                     Table "public.osmoduleloads_2011_09_14"
        Column         |            Type             |                         Modifiers
-----------------------+-----------------------------+------------------------------------------------------------
 guid                  | numeric(20,0)               | not null
 osprocess_guid        | numeric(20,0)               | not null
 filepath_guid         | numeric(20,0)               | not null
 firstloadtime         | numeric(20,0)               | not null
 md5hash               | bytea                       | not null
 host_guid             | numeric(20,0)               | default NULL::numeric
 process_create_time   | numeric(20,0)               | default NULL::numeric
 process_filepath_guid | numeric(20,0)               | default NULL::numeric
 event_time            | timestamp without time zone | default '2011-09-14 00:00:00'::timestamp without time zone
Indexes:
    "osmoduleloads_2011_09_14_pkey" PRIMARY KEY, btree (guid)
    "idx_osmoduleloads_2011_09_14_filepath_guid" btree (filepath_guid)
    "idx_osmoduleloads_2011_09_14_firstload_time" btree (firstloadtime)
    "idx_osmoduleloads_2011_09_14_host_guid" btree (host_guid)
    "idx_osmoduleloads_2011_09_14_md5hash" btree (md5hash)
    "idx_osmoduleloads_2011_09_14_osprocess_guid" btree (osprocess_guid)
Check constraints:
    "osmoduleloads_2011_09_14_event_time_check" CHECK (event_time = '2011-09-14 00:00:00'::timestamp without time zone)
    "osmoduleloads_2011_09_14_firstloadtime_check" CHECK (firstloadtime >= 129604464000000000::bigint::numeric AND firstloadtime < 129605328000000000::bigint::numeric)
Inherits: osmoduleloads

Objective:  The firstloadtime check constraint ensures that the record is applicable to that daily table. (In case you were wondering, the large numerics correspond to the Windows 100-nanosecond since the Epoch.) I'm inserting millions of records into each daily table so "query slowness" is quite easy to spot.  Given that there is so much data per daily table, I was hoping to use the order by and limit clauses to "stop out" a query once it sufficed the limit clause and not be forced to visit each daily table.  However, I'm spending way too much time in the older tables than I'd like - which leads me to believe that I;m doing something wrong.  For ease of viewing, my explain analyze can be found at http://explain.depesz.com/s/tot 

I'm still very new to this so I'm not sure if explain.depesz.com saves the original query.  It wasn't readily apparent that it did so here is the original query:

SELECT osm_1.*, storefiles_1.*, filepaths_1.*, filepaths_2.* FROM (SELECT * FROM osmoduleloads JOIN hosts ON hosts.guid = osmoduleloads.host_guid WHERE hosts.guid = '2007075705813916178' AND osmoduleloads.firstloadtime >= 129604320000000000 AND osmoduleloads.firstloadtime < 129610367990000000 AND hosts.enabled = true AND hosts.user_id = 111 ORDER BY osmoduleloads.firstloadtime DESC LIMIT 251) AS osm_1 LEFT OUTER JOIN storefiles AS storefiles_1 ON osm_1.md5hash = storefiles_1.md5hash LEFT OUTER JOIN filepaths AS filepaths_1 ON osm_1.process_filepath_guid = filepaths_1.guid AND osm_1.event_time = filepaths_1.event_time LEFT OUTER JOIN filepaths AS filepaths_2 ON osm_1.filepath_guid = filepaths_2.guid AND osm_1.event_time= filepaths_2.event_time ORDER BY osm_1.firstloadtime DESC;

Hopefully my assumptions about order by and limit are correct and this query can be optimized.  

Again, appreciate any help you can lend.  Thanks in advance.

Mike