Re: slow query - will CLUSTER help?

Jeff Janes <jeff.janes@xxxxxxxxx> · Thu, 19 Dec 2013 12:33:13 -0800

On Thu, Dec 12, 2013 at 9:30 AM, Sev Zaslavsky <sevzas@xxxxxxxxx> wrote:

    Hello,

    I've got a very simple table with a very simple SELECT query, but it
    takes longer on the initial run than I'd like, so I want to see if
    there is a strategy to optimize this.

    Table rt_h_nbbo contains several hundred million rows.  All rows for
    a given entry_date are appended to this table in an overnight
    process every night - on the order of several million rows per day.

    The objective is to select all of the rows for a given product_id on
    a given entry_date.

    There is a b-tree index on (product_id, entry_date). The index
    appears to be used correctly.  I'm seeing that if the data pages are
    not in memory, nearly all of the time is spent on disk I/O.  The
    first time, the query takes 21 sec.  If I run this query a second
    time, it completes in approx 1-2 ms.

    I perceive an inefficiency here and I'd like your input as to how to
    deal with it: The end result of the query is 1631 rows which is on
    the order of about a couple hundred Kb of data.  Compare that to the
    amount of I/O that was done: 1634 buffers were loaded, 16Mb per page
    - that's about 24 Gb of data!

A page is usually 8KB, not 16MB (nor 16Mb).

  Query completed in 21 sec.  I'd like
    to be able to physically re-organize the data on disk so that the
    data for a given product_id on a entry_date is concentrated on a few
    pages instead of being scattered like I see here. 

If you load the data in daily batches, it is probably already fairly well clustered by entry_date.  If you sort the batch by product_id before bulk loading it, then it should stay pretty well clustered on (entry_date, product_id).

Cheers,

Jeff