On 7/23/2010 12:39 AM, P Kishor wrote:
On Thu, Jul 22, 2010 at 4:56 PM, Andy Colson<andy@xxxxxxxxxxxxxxx> wrote:
On 7/22/2010 9:41 AM, P Kishor wrote:
I have been struggling with this for a while now, have even gone down
a few paths but struck out, so I turn now to the community for ideas.
First, the problem: Store six daily variables for ~ 25 years for cells
in a grid.
* Number of vars = 6
* Number of cells ~ 13 million
* Number of days ~ 9125 (25 * 365)
Optimize the store for two different kinds of queries:
Query one: Retrieve the value of a single var for all or a portion of
the cells for a single day. This is analogous to an image where every
pixel is the value of a single var.
SELECT<var> FROM d WHERE yr = ? AND yday = ?;
SELECT<var> FROM d WHERE yr = ? AND yday = ? AND cell_id IN
(?,?,?...);
Query two: Retrieve values for all the days or a duration of days for
a single var for a single cell. This is like grabbing a column out of
a table in which each row holds all the vars for a single day.
SELECT<var> FROM d WHERE cell_id = ?;
SELECT<var> FROM d WHERE cell_id IN (?,?,?...);
First, I must admit to not reading your entire email.
I am not sure how to respond to your feedback give that you haven't
read the entire email. Nevertheless, thanks for writing...
Heh, sorry, my point was, you put a lot of information into your email,
and I was going to only use one bit of it: row counts.
Second, Query 1 should be fast, regardless of how you layout the tables.
It is not fast. Right now I have data for about 250,000 cells loaded.
That comes to circa 92 million rows per year. Performance is pretty
sucky.
This query should return one record, correct? This should be very fast,
PG should be able to find the record in the index within 5 seeks, and
then find the data in one seek. Can you post 'explain analyze' for this
query. (Or, it could be the case, I totally misunderstood your data)
Third, Query 2 will return 13M rows? I dont think it matters how you layout
the tables, returning 13M rows is always going to be slow.
Yes, I understand that. In reality I will never get 13 M rows. For
display purposes, I will probably get around 10,000 rows to 50,000
rows. When more rows are needed, it will be to feed a model, so that
can be offline (without an impatient human being waiting on the other
end).
Right now, my main problem is that I have either too many rows (~4 B
rows) in a manageable number of tables (25 tables) or manageable
number of rows (~13 M rows) in too many tables (~9000 tables).
Searching by just cell_id is not going to be very selectable, and with
large result-sets I can see this one being slow. As Scott talked about
in his response, this one will come down to hardware. Have you dd
tested your hardware?
-Andy
--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general