On 09/10/13 21:05, Alban Hertroys wrote:
On Oct 9, 2013, at 4:08, Kevin Grittner <kgrittn@xxxxxxxxx> wrote:
Toby Corkindale <toby.corkindale@xxxxxxxxxxxxxxxxxxxx> wrote:
In this instance, we have a lot of queries that build certain
aggregate results, which are very slow. The queries were
initially all implemented as views, but then we started doing a
type of materialising of our own, turning them into tables with
CREATE TABLE AS SELECT .... This does make the results very fast
to access now, but the side effect is a vast number of (very
small) tables.
If you have multiple tables with identical layout but different
subsets of the data, you will probably get better performance by
putting them into a single table with indexes which allow you to
quickly search the smaller sets within the table.
I was thinking just that while reading Toby's message. For example,
you could put the results of several related aggregations into a
single materialized view, if they share the same key columns (year,
month, factory or something similar). I'm not sure the new built-in
materialized views can be updated like that though, unless you manage
to combine those aggregations into a single monster-query, but that
will probably not perform well... What we tend to do at work (no
PostgreSQL, unfortunately) is to use external tools to combine those
aggregated results and store that back into the database (which we
often need to do anyway, as we deal with several databases on several
servers).
Thanks for the suggestions, all.
As I noted in an earlier email -- we're aware that the schema could be
better designed, but making large changes is tricky in production systems.
Many of the tables are actually unique, but only in the sense that you
have various (common) identifier fields, and then a few (unique)
aggregate-results per table.
eg:
int id_key_1, int id_key_2, .., float FooBarXResult
I suspect the correct way to handle this would actually be a table that
looked like:
int id_key_1, int id_key_2, .., text result_name, float result_value
Although that would in turn make other queries more verbose, for
example, currently one can do:
select *
from FooResult
join BarResult using (id_key_1, id_key_2)
where FooResultX > 0.9 and BarResultY < 0.1;
I guess that turns into something like this:
select id_key_1, id_key_2,
a.result_value as FooResultX, b.result_value as FooResultY
from AllResults a
join AllResults b using (id_key_1, id_key_2)
where a.result_name = "FooResultX"
and a.result_value > 0.9
and b.result_name = "BarResultY"
and b.result_value < 0.1;
So it's all do-able, but it does look nicer to separate things into
their own tables with named columns.
Additionally, if you have that many tables, it sounds like you
partitioned your data. With aggregated results, the need for
partitioning is much less (or perhaps it isn't even needed at all).
And perhaps you don't even need the data from all partitions; say if
you have monthly partitions of data, do you really need aggregated
results from 5 years ago?
You're correct, we do have partitioned tables due to the amount of data
in the system, but that's for just the non-aggregated data.
Those tables perform just fine!
It's the hundreds of thousands of views and tables with just a few rows
in them that worry me.. :)
That said, users excel in finding data to request that you thought
they wouldn't need.
> Which brings me to another question: Do your users really need the
> data from all those views or do they only think they need that?
Ah, indeed, users have not individually requested each of these many
thousands of tables and views. They are part of a large application, and
the results from all of those are required by it.
If I rewrote the application today, I'd be looking at doing things very
differently, knowing how it would eventually scale.
--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general