On 11/1/07, Kynn Jones <kynnjo@xxxxxxxxx> wrote: > Hi. This is a recurrent problem that I have not been able to find a > good solution for. I have large database that needs to be built from > scratch roughly once every month. I use a Perl script to do this. > > The tables are very large, so I avoid as much as possible using > in-memory data structures, and instead I rely heavily on temporary > flat files. > > The problem is the population of tables that refer to "internal" IDs > on other tables. By "internal" I mean IDs that have no meaning > external to the database; they exist only to enable relational > referencing. They are always defined as serial integers. So the > script either must create and keep track of them, or it must populate > the database in stages, letting Pg assign the serial IDs, and query > the database for these IDs during subsequent stages. If it is possible, perhaps you could load "raw" data into temporary table and then create ids using these tables. For instance: CREATE TEMP TABLE foo_raw (host text, city text, who text, value int); INSERT INTO hosts (host) SELECT DISTINCT host FROM foo; -- group by perhaps? INSERT INTO [...] INSERT INTO foo SELECT host_id,city_id,who_id,value FROM foo_raw JOIN hosts USING (host) JOIN cities USING (city) JOIN who USING (who); This may or may not work, depending on your setup. But perhaps a better approach, while needing more work would be: Your script establishes two DB connections, one for "processing data" and one for maintaining IDs. Now whenever you need to get and ID do: 1) query memcached if found, return it 2) query database if found return it and insert into memcached 3) insert into database, and insert into memcached, and perhaps commit it. Befriend thyself with Cache::* perl modules. :) Regards, Dawid ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings