I have some data in the form of a matrix of doubles (~2 million
rows, ~400 columns) that I would like to store in a Pg table,
along with the associated table of metadata (same number of rows,
~30 columns, almost all text). This is large enough to make
working with it from flat files unwieldy.
(The only reason for wanting to transfer this data to a Pg table
is the hope that it will be easier to work with it by using SQL
queries to extract subsets of it, than by, e.g., writing programs
that must scan the entire matrix every time they're run.)
My plan is to add one extra column to the Pg tables to serve as
the primary key joining the data and the metadata tables
together.
I could be wrong, but it seems to me that the records in the two files have a one to one correspondence with each other. That is, record #1 in file #1 has data related to record #1 in file #2. And that there should therefore always be the same number of records in both files. So my first thought is why not just have one table in PostgreSQL which contains the data merged from the corresponding records in the two files? IMO, this is a cleaner data design than trying to ensure that two tables are kept "in sync" as records are added and deleted. Of course, this could possible hurt performance, due to reading a lot of data, if you consistently only need data from one file and only occasionally read the other.
If you insist on two for some reason, then I would have the load program read a record from each file, incrementing a counter, and INSERT the data from the files into the tables using the counter as the primary key. To ease things a bit, assuming the 1:1, I would have the PK on the more active file be a foreign key on the other file, with an ON CASCADE DELETE so that when you DELETE records from the more active, the corresponding record in the secondary table will automatically be deleted. You might even want some sort of CONTRAINT trigger to do a DELETE on the primary table if a record is DELETEd from the secondary.
Also, if you really want to separate the data into more than one table, then why only two? In this case, which I admit that I personally _dislike_, you might want to do an analysis of what columns you access together frequently and put those columns into separate tables. Again, the main plus of this would be memory usage and I/O on the data base server side. May Joe Celko forgive me for introducing physical DB configuration into a data structure discussion.
<snip>
Thanks in advance!
kj
--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
If you sent twitter messages while exploring, are you on a textpedition?
He's about as useful as a wax frying pan.
10 to the 12th power microphones = 1 Megaphone
Maranatha! <><
John McKown
He's about as useful as a wax frying pan.
10 to the 12th power microphones = 1 Megaphone
Maranatha! <><
John McKown