Re: newbie table design question

Tom Lane <tgl@xxxxxxxxxxxxx> · Sun, 31 May 2009 13:25:25 -0400

Andrew Smith <laconical@xxxxxxxxx> writes:
> I'm a beginner when it comes to Postgresql, and have a table design question
> about a project I'm currently working on.  I have 1500 data items that need
> to be copied every minute from an external system into my database.  The
> items have a timestamp, an identifier and a value.  For example:

> 12/10/2008 05:00       ID_ABC      14
> 12/10/2008 05:01       ID_ABC      17
> 12/10/2008 05:02       ID_ABC      13

> Pretty simple stuff.  The current solution (implemented using SQL Server a
> few years ago) looks like this (an approximation using Postgresql syntax):

> CREATE TABLE "DataImport"
> (
>   "DataImportID" serial NOT NULL PRIMARY KEY,
>   "Time" timestamp without time zone NOT NULL,
>   "ID_ABC" integer NOT NULL,
>   "ID_DEF" integer NOT NULL,
>   "ID_HIJ" integer NOT NULL,
>   etc
> );

So the table would have ~1500 columns?  You definitely don't want to do
it that way in Postgres --- you'd be way too close to the maximum column
count limitation.

> My initial thought for the design of the new solution looks like this:

> CREATE TABLE "DataImport"
> (
>   "DataImportID" serial NOT NULL PRIMARY KEY,
>   "Time" timestamp without time zone NOT NULL,
>   "Identifier" text NOT NULL,
>   "Value" integer NOT NULL
> );

DataImportID couldn't be a primary key here, could it?  Or is it just
meant as an artificial primary key?  If so, consider not bothering with
it at all --- (Time, Identifier) seems like a perfectly good natural
key, and with such short records the extra space for a serial column
is not negligible.

Anyway, my answer would definitely lean towards using this normalized
representation, if all the data values are integers.  (If they're not,
it gets messier...)

> Users will then be doing regular queries on this data (say, a few hundred
> times per day), such as:

> SELECT "Time", "Value" FROM "DataImport" WHERE "Identifier" = 'ID_ABC' AND
> "Time" between '2008-11-07' and '2008-11-11';

An index on (Identifier, Time) (not the other way around) should work
well for that type of query.

> My concern is that 1500 values * 14400 minutes per day = 21,600,000
> records.  Add this up over the course of a month (the length of time I need
> to keep the data in this table) and I'll have more than half a billion
> records being stored in there.

That's not an especially big table.  However it might be worth your
trouble to use partitioning.  Not to speed searches, particularly, but
to make it easier to drop 1/30th of the data each day.

			regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general