Re: Slow performance when querying millions of rows

Tomas Vondra <tv@xxxxxxxx> · Wed, 29 Jun 2011 00:39:25 +0200

Dne 28.6.2011 23:28, Craig McIlwee napsal(a):
> Daily table explain analyze: http://explain.depesz.com/s/iLY
> Half month table explain analyze: http://explain.depesz.com/s/Unt

Are you sure those two queries are exactly the same? Because the daily
case output says the width is 50B, while the half-month case says it's
75B. This might be why the sort/aggregate steps are switched, and that
increases the amount of data so that it has to be sorted on disk (which
is why the half-month is so much slower).

Haven't you added some columns to the half-month query?

> Postgres version:
> PostgreSQL 8.4.8, compiled by Visual C++ build 1400, 32-bit
> 
> History:
> None, this is a new database and application
> 
> Hardware:
> - 2 Intel Xeon 2.13GHz processors with 8 cores each
> - 8GB RAM
> - Disks are in a RAID 5 configuration with a PERC H700 Integrated RAID
> Controller 512MB Cache
> - 5 disks, Seagate 7200 RPM SAS, 500GB each for a total capacity of
> about 2TB
> - Windows Server 2008 R2 64-bit (but 32-bit postgres)
> - Hardware upgrades arent an option at this point due to budget and time
> constraints

Not much experience with PostgreSQL on Windows, but this looks good to
me. Not sure if RAID5 is a good choice, especially because of write
performance - this is probably one of the reasons why the disk sort is
so slow (in the half-month case).

And it's nice you have 8 cores, but don't forget each query executes on
a single background process, i.e. it may use single core. So the above
query can't use 8 cores - that's why the in-memory sort takes so long I
guess.

> Maintenance Setup:
> Autovacuum is disabled for these tables since the data is never
> updated.  The tables that we are testing with at the moment will not
> grow any larger and have been both clustered and analyzed.  They were
> clustered on the vds_detector_data_timestamp_idx index.
> 
> GUC Settings:
> effective_cache_size: 2048MB
> work_mem: 512MB
> shared_buffers: 64MB, 512MB, and 1024MB, each yielded the same query
> plan and took the same amount of time to execute give or take a few seconds
> 
> Summary:
> 
> The time to get the raw data (before aggregation and sorting) is
> relatively similar between the daily and half month tables.  It would
> appear that the major difference is the ordering of sort and
> aggregation, the daily tables aggregate first so the amount of data
> sorted is significantly less.

Yes, the ordering is the problem. The amount of data to sort is so huge
(3GB) it does not fit into work_mem and has to be sorted on disk. Not
sure why this happens, the only difference I've noticed is the 'width'
(50B vs. 75B). Are those two queries exactly the same?

> Since the daily tables are only 360MB, I would hope that the entire
> table could be pulled into memory with one large sequential read.  Of
> course this assumes that the file pieces are stored contiguously, but
> auto defrag is enabled and shows low fragmentation so Im trusting (as
> much as one can) Windows to do the right thing here.  My drives have a
> 150MB/s sustained max throughput, and considering that data is spread
> across 5 drives I would hope to at least be able to reach the single
> disk theoretical limit and read an entire table plus the index into
> memory about 4 to 5 seconds.  Based on the analyze output, each daily
> table averages 6 to 7 seconds, so Im pretty close there and maybe just
> limited by disk speed?

Well, you have 30 partitions and 7 seconds for each means 210 secons in
total. Which is about the time you get (before the aggregate/sort).

You have to check where the bottleneck is - is it the I/O or CPU? I'd
guess the CPU, but I may be wrong. On unix I'd use something like
iostat/vmstat/top to see what's going on - not sure what to use on
Windows. I guess there is a some console or maybe Process Explorer from
sysinternals.

> In both cases, the row estimates vs actual are way off.  Ive increased
> statistics on the reading_timestamp and road_name columns to 100 and
> then 1000 with no change.  I ran an ANALYZE after each statistics
> change.  Should I be upping stats on the non-indexed columns as well? 
> Ive read documentation that says I should be able to set statistics
> values for an entire table as opposed to per column, but havent found
> how to do that.  I guess I was either too lazy to update statistics on
> each column or just didnt think it would help much.

The estimates seem pretty good to me - 10x difference is not that much.
Could be better, but I don't think you can get a better plan, is seems
very reasonable to me.

> So, any pointers for performance improvement?

Three ideas what might help

1) partial indexes

How much do the parameters in the query change? If there are parameters
that are always the same, you may try to create partial indexes. For
example if the 'vdd.volume' always has to be '0', then you can create
the index like this

 CREATE INDEX vds_detector_data_dir_idx
   ON vds_detector_data
   USING btree
   (road_dir)
   WHERE (vdd.volume!=0);

That way only the rows with 'vdd.volume!=0' will be included in the
index, the index will be smaller and the bitmap will be created faster.
Similarly for the other conditions. The smaller the index will be, the
faster the bitmap creation.

If all the conditions may change, or if the index size does not change
much, you can't do this.

2) prebuilt results

Another option is precomputation of the 'per partition results' - if you
know what the conditions are going to be, you can precompute them and
then query just those (much smaller) tables. But this is very
application specific.

Again, if the all the conditions change, you can't do this.

3) distribute the queries

As I've mentioned, PostgreSQL does not distribute the queries on
multiple CPU cores, but you can do that on your own at the application
level.

For example I see the GROUP BY clause contains 'date, hour, min' so you
can compute the results for each partition separately (in a different
thread, using a separate connection) and then 'append' them.

Yes, you'll need to keep some metadata to do this efficiently (e.g. list
of partitions along with from/to timestamps), but you should do this
anyway I guess (at least I do that when partitioning tables, it makes
the management much easier).

Not sure if you can do this with the other queries :-(

regards
Tomas

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance