optimizing daily data storage in Pg

P Kishor <punk.kish@xxxxxxxxx> · Thu, 22 Jul 2010 09:41:12 -0500

I have been struggling with this for a while now, have even gone down
a few paths but struck out, so I turn now to the community for ideas.
First, the problem: Store six daily variables for ~ 25 years for cells
in a grid.

 * Number of vars = 6
 * Number of cells ~ 13 million
 * Number of days ~ 9125 (25 * 365)

Optimize the store for two different kinds of queries:

Query one: Retrieve the value of a single var for all or a portion of
the cells for a single day. This is analogous to an image where every
pixel is the value of a single var.

Query two: Retrieve values for all the days or a duration of days for
a single var for a single cell. This is like grabbing a column out of
a table in which each row holds all the vars for a single day.

So, I set about designing the db. The "grid" is in a table with 13 million rows

    CREATE TABLE cells (
        cell_id      INTEGER,
        other_data ..
    )
    WITH (
        OIDS=FALSE
    )

A single table *where every row is one day's values for one cell* looks like so

    CREATE TABLE d (
        yr      SMALLINT,
        yday    SMALLINT,
        a       SMALLINT,
        b       SMALLINT,
        d       SMALLINT,
        e       SMALLINT,
        f       SMALLINT,
        g       SMALLINT,
        cell_id INTEGER
    )
    WITH (
        OIDS=FALSE
    )

The data would look like so

    yr      yday    a   b   c   d   e   f   g   cell_id
    ----------------------------------------------------
    1980    1       x   x   x   x   x   x   x   1
    ..
    1980    365     x   x   x   x   x   x   x   1
    ...
    1981    1       x   x   x   x   x   x   x   1
    ..
    1981    365     x   x   x   x   x   x   x   1
      ...
      ...
    2005    1       x   x   x   x   x   x   x   1
    ..
    2005    365     x   x   x   x   x   x   x   1
    ......
    1980    1       x   x   x   x   x   x   x   2
    ..
    1980    365     x   x   x   x   x   x   x   2
      ...

I could now (theoretically) conduct my queries like so:

Query 1a: Retrieve the value of a single var for all the cells for a
single day. This is analogous to an image where every pixel is the
value of a single var.

    SELECT <var> FROM d WHERE yr = ? AND yday = ?;

I assuming I would need an index on yr and yday, or perhaps even a
compound index on (yr, yday).

Query 1b: Retrieve the value of a single var for a portion of the
cells for a single day. This is analogous to an image where every
pixel is the value of a single var.

    SELECT <var> FROM d WHERE yr = ? AND yday = ? AND cell_id IN (?,?,?...);

I assuming I would need an index on yr and yday, or perhaps even a
compound index on (yr, yday) AND an index on cell_id.

Query 2: Retrieve values for all the days or a duration of days for a
single var for a single cell. This is like grabbing a column out of a
table in which each row holds all the vars for a single day.

    SELECT <var> FROM d WHERE cell_id = ?;
    SELECT <var> FROM d WHERE cell_id IN (?,?,?...);

Once again, an index on cell_id would assist in the above.

The problem: The above table would have 13 M * 9125 rows ~ 118 billion
rows. Huge indexes, slow queries, etc. In fact, major issues loading
the data in the first place. Since I am loading data in batches, I
drop the indexes (takes time), COPY data into the table (takes time),
build the indexes (takes time), experiment with improving the
performance (takes time), fail, rinse, lather, repeat. I actually
tried the above with a subset of data (around 100 M rows) and
experienced all of the above. I don't remember the query times, but
they were awful.

So, I partitioned the table into years like so

    CREATE TABLE d_<yyyy> (
        CHECK ( yr = <yyyy> )
    ) INHERITS (d)

Hmmm... still no satisfaction. I ended up with 1 master table + 25
inherited tables. Each of the year tables now had ~ 4.75 billion rows
(13 M * 365), and the queries were still very slow.

So, I partitioned it all by years and days like so

    CREATE TABLE d_<yyyy>_<yday> (
        CHECK ( yr = <yyyy> AND yday = <yday> )
    ) INHERITS (d)

Each table now has 13 million rows, and is reasonably fast (although
still not satisfactorily fast), but now I have 9K tables. That has its
own problems. I can't query the master table anymore as Pg tries to
lock all the tables and runs out of memory. Additionally, I can't
anymore conduct query two above. I could do something like

    SELECT a FROM d_1980_1 WHERE cell_id = 1
    UNION
    SELECT a FROM d_1980_2 WHERE cell_id = 1
    UNION
    SELECT a FROM d_1980_3 WHERE cell_id = 1
    UNION
    SELECT a FROM d_1980_4 WHERE cell_id = 1
    UNION
    ...

But the above is hardly optimal.

Any suggestions, ideas, brainstorms would be appreciated. Perhaps Pg,
or even a RDBMS, is not the right tool for this problem, in which
case, suggestion for alternatives would be welcome as well.

Right now I am testing this on a dual Xeon dual core 3 GHz Xserve with
12 GB RAM. The PGDATA directory is located on an attached RAID that is
configured as RAID5. Reasonable time for a query would be under 500
ms, although ultimately I would love to have the query be done under
250 ms, perhaps with RAID10, and a machine with more RAM. I have
access to a machine with dual Xeon quad core 3 GHz Xserve with 32 GB
RAM, and an internal RAID, but before I try to move the data, I want
to actually conclusively prove that Pg is the best solution (or not).
Perhaps flat files are better, perhaps blobs are better, or perhaps
Pg's array column type. Any ideas/suggestions welcome.

Oh, one more thing. Once the data are loaded and everything is
working, the data are readonly. They are historical, so they don't
need to be changed.

-- 
Puneet Kishor

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general