Re: Slow performance when querying millions of rows

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 06/28/2011 05:28 PM, Craig McIlwee wrote:
Autovacuum is disabled for these tables since the data is never updated. The tables that we are testing with at the moment will not grow any larger and have been both clustered and analyzed.

Note that any such prep to keep from ever needing to maintain these tables in the future should include the FREEZE option, possibly with some parameters tweaked first to make it more aggressive. Autovacuum will eventually revisit them in order to prevent transaction ID wrap-around, even if it's disabled. If you're going to the trouble of prepping them so they are never touched again, you should do a freeze with the right parameters to keep this from happening again.

work_mem: 512MB
shared_buffers: 64MB, 512MB, and 1024MB, each yielded the same query plan and took the same amount of time to execute give or take a few seconds

shared_buffers doesn't normally impact the query plan; it impacts how much churn there is between the database and the operating system cache, mainly important for making write-heavy work efficient. On Windows, you'll probably be safe to set this to 512MB and forget about it. It doesn't benefit from large values anyway.

This is a very large work_mem setting however, so be careful that you won't have many users connecting at once if you're going to use it. Each connection can use a multiple of work_mem, making it quite possible you could run out of memory with this configuration. If that low user count is true, you may want to make sure you're enforcing it by lowering max_connections, as a safety measure to prevent problems.

Since the daily tables are only 360MB, I would hope that the entire table could be pulled into memory with one large sequential read. Of course this assumes that the file pieces are stored contiguously, but auto defrag is enabled and shows low fragmentation so Im trusting (as much as one can) Windows to do the right thing here. My drives have a 150MB/s sustained max throughput, and considering that data is spread across 5 drives I would hope to at least be able to reach the single disk theoretical limit and read an entire table plus the index into memory about 4 to 5 seconds. Based on the analyze output, each daily table averages 6 to 7 seconds, so Im pretty close there and maybe just limited by disk speed?

One thing to note is that your drive speed varies based on what part of the disk things are located at; the slower parts of the drive will be much less than 150MB/s.

On Linux servers it's impossible to reach something close to the disk's raw speed without making the operating system read-ahead feature much more aggressive than it is by default. Because PostgreSQL fetches a single block at a time, to keep the drive completely busy something has to notice the pattern of access and be reading data ahead of when the database even asks for it. You may find a parameter you can tune in the properties for the drives somewhere in the Windows Control Panel. And there's a read-ahead setting on your PERC card that's better than nothing you may not have turned on (not as good as the Linux one, but it's useful). There are two useful settings there ("on" and "adaptive" if I recall correctly) that you can try, to see which works better.

Ive read documentation that says I should be able to set statistics values for an entire table as opposed to per column, but havent found how to do that. I guess I was either too lazy to update statistics on each column or just didnt think it would help much.

You can adjust the statistics target across the entire database using the default_statistics_target setting, or you can tweak them per column using ALTER TABLE. There is no table-level control. I find it difficult to answer questions about whether there is enough stats or not without actually looking at pg_stats to see how the database is interpreting the data, and comparing it against the real distribution. This is an area where flailing about trying things doesn't work very well; you need to be very systematic about the analysis and testing strategy if you're going to get anywhere useful. It's not easy to do.

As a larger commentary on what you're trying to do, applications like this often find themselves at a point one day where you just can't allow arbitrary user queries to run against them anymore. What normally happens then is that the most common things that people really need end up being run one and stored in some summary form, using techniques such as materialized views: http://wiki.postgresql.org/wiki/Materialized_Views

In your case, I would start now on trying to find the common patters to the long running reports that people generate, and see if it's possible to pre-compute some portion of them and save that summary. And you may find yourself in a continuous battle with business requests regardless. It's often key decision makers who feel they should be able to query any way they want, regardless of its impact on the database. Finding a middle ground there is usually challenging.

--
Greg Smith   2ndQuadrant US    greg@xxxxxxxxxxxxxxx   Baltimore, MD
Comprehensive and Customized PostgreSQL Training Classes:
http://www.2ndquadrant.us/postgresql-training/


--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux