On Mon, Jun 1, 2009 at 1:25 AM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
After my original post, I found out that the current solution in SQL Server actually had to be split into two different tables due to a similar limitation.
You're right, I guess there doesn't seem to be much point having that surrogate key in place...and it does take up space. I did a quick test and got the following figures:
1 million records with DataImportID = ~80mb
1 million records without DataImportID = ~50mb.
That'll certainly add up over time.
I'll give it a try.
Re-reading this I noticed that I had an extra 0 in one of my figures - there's only 1440 minutes in a day, so my table gets down to the much more manageable 2.16 million records per day instead of 21.6 million. I'll have a look into partitioning too, thanks for your help!
Cheers,
Andrew
Andrew Smith <laconical@xxxxxxxxx> writes:So the table would have ~1500 columns? You definitely don't want to do
> I'm a beginner when it comes to Postgresql, and have a table design question
> about a project I'm currently working on. I have 1500 data items that need
> to be copied every minute from an external system into my database. The
> items have a timestamp, an identifier and a value. For example:
> 12/10/2008 05:00 ID_ABC 14
> 12/10/2008 05:01 ID_ABC 17
> 12/10/2008 05:02 ID_ABC 13
> Pretty simple stuff. The current solution (implemented using SQL Server a
> few years ago) looks like this (an approximation using Postgresql syntax):
> CREATE TABLE "DataImport"
> (
> "DataImportID" serial NOT NULL PRIMARY KEY,
> "Time" timestamp without time zone NOT NULL,
> "ID_ABC" integer NOT NULL,
> "ID_DEF" integer NOT NULL,
> "ID_HIJ" integer NOT NULL,
> etc
> );
it that way in Postgres --- you'd be way too close to the maximum column
count limitation.
After my original post, I found out that the current solution in SQL Server actually had to be split into two different tables due to a similar limitation.
DataImportID couldn't be a primary key here, could it? Or is it just
> My initial thought for the design of the new solution looks like this:
> CREATE TABLE "DataImport"
> (
> "DataImportID" serial NOT NULL PRIMARY KEY,
> "Time" timestamp without time zone NOT NULL,
> "Identifier" text NOT NULL,
> "Value" integer NOT NULL
> );
meant as an artificial primary key? If so, consider not bothering with
it at all --- (Time, Identifier) seems like a perfectly good natural
key, and with such short records the extra space for a serial column
is not negligible.
You're right, I guess there doesn't seem to be much point having that surrogate key in place...and it does take up space. I did a quick test and got the following figures:
1 million records with DataImportID = ~80mb
1 million records without DataImportID = ~50mb.
That'll certainly add up over time.
Anyway, my answer would definitely lean towards using this normalized
representation, if all the data values are integers. (If they're not,
it gets messier...)
An index on (Identifier, Time) (not the other way around) should work
> Users will then be doing regular queries on this data (say, a few hundred
> times per day), such as:
> SELECT "Time", "Value" FROM "DataImport" WHERE "Identifier" = 'ID_ABC' AND
> "Time" between '2008-11-07' and '2008-11-11';
well for that type of query.
I'll give it a try.
That's not an especially big table. However it might be worth your
> My concern is that 1500 values * 14400 minutes per day = 21,600,000
> records. Add this up over the course of a month (the length of time I need
> to keep the data in this table) and I'll have more than half a billion
> records being stored in there.
trouble to use partitioning. Not to speed searches, particularly, but
to make it easier to drop 1/30th of the data each day.
Re-reading this I noticed that I had an extra 0 in one of my figures - there's only 1440 minutes in a day, so my table gets down to the much more manageable 2.16 million records per day instead of 21.6 million. I'll have a look into partitioning too, thanks for your help!
Cheers,
Andrew