Re: Performance issues with large amounts of time-series data

Hrishikesh (हृषीकेश मेहेंदळे) <hashinclude@xxxxxxxxx> · Wed, 26 Aug 2009 11:39:40 -0700

Hi Tom,

Thanks for your quick response.

2009/8/26 Tom Lane <tgl@xxxxxxxxxxxxx>
> <hashinclude@xxxxxxxxx> writes:
> > In my timing tests, the performance of PG is quite a lot worse than the
> > equivalent BerkeleyDB implementation.
>
> Are you actually comparing apples to apples?  I don't recall that BDB
> has any built-in aggregation functionality.  It looks to me like you've
> moved some work out of the client into the database.

I'm measuring end-to-end time, which includes the in-code aggregation
with BDB (post DB fetch) and the in-query aggregation in PG.

> > 1. Is there anything I can do to speed up performance for the queries?
>
> Do the data columns have to be bigint, or would int be enough to hold
> the expected range?  SUM(bigint) is a *lot* slower than SUM(int),
> because the former has to use "numeric" arithmetic whereas the latter
> can sum in bigint.  If you want to keep the data on-disk as bigint,
> but you know the particular values being summed here are not that
> big, you could cast in the query (SUM(data_1::int) etc).

For the 300-sec tables I probably can drop it to an integer, but for
3600 and 86400 tables (1 hr, 1 day) will probably need to be BIGINTs.
However, given that I'm on a 64-bit platform (sorry if I didn't
mention it earlier), does it make that much of a difference? How does
a float ("REAL") compare in terms of SUM()s ?

> I'm also wondering if you've done something to force indexscans to be
> used.  If I'm interpreting things correctly, some of these scans are
> traversing all/most of a partition and would be better off as seqscans.
One thing I noticed is that if I specify what devices I want the data
for (specifically, all of them, listed out as DEVICE IN (1,2,3,4,5...)
in the WHERE clause, PG uses a Bitmap heap scan, while if I don't
specify the list (which still gives me data for all the devices), PG
uses a sequential scan. (I might have missed the DEVICE IN (...) in my
earlier query). However, more often than not, the query _will_ be of
the form DEVICE IN (...). If I actually execute the queries (on the
psql command line), their runtimes are about the same (15s vs 16s)

> >      shared_buffers = 128MB
>
> This is really quite lame for the size of machine and database you've
> got.  Consider knocking it up to 1GB or so.

OK, I've bumped it up to 1 GB. However, that doesn't seem to make a
huge difference (unless I need to do the same on libpqxx's connection
object too).

Cheers,
Hrishi

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance