Re: Optimizing data layout for reporting in postgres

Alban Hertroys <dalroi@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> · Sun, 27 Dec 2009 13:24:48 +0100

On 24 Dec 2009, at 21:01, Doug El wrote:

> Hi,
> 
> I have to summarize some data to be queried and there are about 5 million raw records a day I need to summarize. In a nutshell I don't think I'm laying it out in an optimal fashion, or not taking advantage of some postgres features perhaps, I'm looking for feedback.
> 
> The raw incoming data is in the form of
> 
> ip string uint uint uint uint
> 
> So for any given record say:
> 
> 8.8.8.8 helloworld 1 2 3 4
> 
> First, I need to be able to query how many total and how many unique requests there were (unique by ip), over given time frame.
> 
> So for the below data on the same day that's total two, but one unique
> 
> 8.8.8.8 helloworld 1 2 3 4
> 8.8.8.8 helloworld 1 2 3 4
> 
> Further for all fields (but ip which is not stored) I need to be able to query and get total/unique counts based off any combination of criteria.
> 
> So if I refer to them as columns A-E
> 
> A		B		C		D		E
> string 	uint 	uint 	uint 	uint

That's going to be a sequential scan no matter how you write it - even if you follow Scott's advice, albeit the scan happens at a more convenient time and only once. To optimise this you first minimise the number of scans you need to do to get the desired results and next minimise the amount of work that the database needs to perform per row.

I think a good start would be:
SELECT COUNT(*) AS total, COUNT(DISTINCT A||B::text||C::text||D::text||E::text)
  FROM table;

Be wary of NULL values as those will make the result of the concatenation NULL as well. Coalesce() is your friend if that happens.

To further speed up the query you could pre-generate the concatenation of those columns, either when you insert the data or with a nightly cron-job. In the latter case make sure it runs after all the data has arrived and before anyone queries that column or there will be NULLs in it.

> I need to be able and say how many where col A = 'helloworld' and say col C = 4.
> Or  perhaps col E = 4 and col c < 3 etc, any combination.

I don't see what the problem is here? Aren't those just queries like:
SELECT COUNT(*) FROM table WHERE A = 'helloworld' AND C = 4;
SELECT COUNT(*) FROM table WHERE E = 4 AND C < 3;

If you know beforehand which conditions you want to query you can do them all in one go - it will result in a sequential scan though:

SELECT SUM((A = 'helloworld' AND C = 4)::int), SUM((E = 4 AND C < 3)::int), ...
  FROM table;

This makes use of the fact that a boolean result cast to int results in 0 for False and 1 for True respectively.

If the conditions you want to summarise are fixed (never change) then you could store their respective values in a separate column using a bit-field or something of the kind, or in multiple columns with descriptive names.

> The database however is still quite huge and grows very fast, even simple daily queries are fairly slow even on a fast server. I have a few indexes on what I know are common columns queried against but again, any combination of data can be queried, and  indexes do increase db size of course.

Indexes don't help for queries where a sequential scan is used, so you don't need to worry about intricate index designs for such queries. They certainly do help (a lot!) for queries that query a minority of the total rows.

> I feel like there's got to be some better way to organize this data and make it searchable.  Overall speed is more important than disk space usage for this application. 

Anything that you can calculate at a convenient time before you need to query it is a win to store, as long as querying the results isn't more work than querying the original data.

I've used cron jobs and triggers for such cases both successfully and unsuccessfully - the risk here is that the derived data has to match the data it was derived from or you end up with wrong results. At one point (at a previous job) we were creating a materialised view from data from about ten tables using triggers, but we noticed the results didn't always match what we expected. Figuring out where it went wrong took more time than we could afford so we ended up using a nightly cron-job instead.
I suppose I mean to say to use triggers to pre-calculate data for simple cases but to prefer cron jobs for the complicated ones. Debugging complicated triggers can be time-consuming.

Regards,
Alban Hertroys

--
Screwing up is the best way to attach something to the ceiling.

!DSPAM:737,4b3752129951606741641!

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general