On 1/17/07, Jeremy Haile <jhaile@xxxxxxxxxxx> wrote:
Heh, querying a different range of data was a better idea compared to rebooting.. doh.. So I think you reasonably established that an index scan for unbuffered data would still be faster than a sequential scan.
To me this is one of those cases where the optimizer doesn't understand the clustering of the data, and it is being misled by the statistics and fixed parameters it has. If you have fast disks (I think a fiber SAN probably counts here) then adjusting random_page_cost lower is reasonable, the lowest I have heard recommended is 2.0. It would be nice if the database could learn to estimate these values, as newer versions of Oracle does.
Yes, cluster would rebuild the table for you. I wouldn't do anything too intrusive, run with the random_page_cost lowered, perhaps vacuum full, reindex, and see what happens. If it degrades over time, then I would start looking at partitioning or some other solution.
Yep, my thoughts exactly. Partitioning support is PostgreSQL is there, but it needs a bit more of a tighter integration into the core (I shouldn't have to create a view, n tables, n rules, etc). Additionally, I have read that at some point when you have "y" partitions the performance degrades, haven't really looked into it myself.
Maybe - I tried running the same query for an older time range that is
less likely to be cached. The index scan took longer than my previous
example, but still only took 16 seconds, compared to the 87 seconds
required to seqscan the table. When I can, I'll restart the machine and
run a comparison again to get a "pure" test.
Heh, querying a different range of data was a better idea compared to rebooting.. doh.. So I think you reasonably established that an index scan for unbuffered data would still be faster than a sequential scan.
To me this is one of those cases where the optimizer doesn't understand the clustering of the data, and it is being misled by the statistics and fixed parameters it has. If you have fast disks (I think a fiber SAN probably counts here) then adjusting random_page_cost lower is reasonable, the lowest I have heard recommended is 2.0. It would be nice if the database could learn to estimate these values, as newer versions of Oracle does.
Date is almost always a criteria in scans of this table. As mentioned
earlier, the table is naturally built in date order - so would
rebuilding the table help? Is it possible that even though I'm
inserting in date order, since I delete rows so often the physical
correlation would get disrupted as disk pages are reused? Perhaps
clustering on the transaction_date index and periodically running
"cluster" would help? Does vacuum full help with this at all?
Yes, cluster would rebuild the table for you. I wouldn't do anything too intrusive, run with the random_page_cost lowered, perhaps vacuum full, reindex, and see what happens. If it degrades over time, then I would start looking at partitioning or some other solution.
Yeah - partitioning makes a lot of sense and I've thought about doing
this in the past. Although I do run queries that cross multiple days,
most of my queries only look at today's data, so the physical disk
organization would likely be much better with a partitioned table setup.
Also, since I usually delete old data one day at a time, I could simply
drop the old day's partition. This would make vacuuming much less of an
issue.
Yep, my thoughts exactly. Partitioning support is PostgreSQL is there, but it needs a bit more of a tighter integration into the core (I shouldn't have to create a view, n tables, n rules, etc). Additionally, I have read that at some point when you have "y" partitions the performance degrades, haven't really looked into it myself.
--
Chad
http://www.postgresqlforums.com/