I see a lot of problems with this idea. You mention that the database is supposed to be available 24x7. While you are loading, the database table receiving data will not be available. Therefore, you will have to have one server online (with only the old data), while the other one is loading. Once the load and all procedures are complete, you could switch the active server. You do not mention your time constraints and the total volume of data. If the new data must become available at some critical time, this is a very important requirement that must be spelled out clearly. You will need some kind of guarantee of relational integrity. This is always difficult to achieve when bulk loading from foreign sources. I think it is important to spell things out more clearly. How many tables are to be replicated? What is the total number of expected rows for each table? How fast are the tables expected to grow? When must the new data become available online? Are all of the tables in the database populated from a foreign source or just some of them? Do you also have access to the source data in its database format, or only as text dumps? Is the goal to serve as a reporting server? Is the goal to make the same data as the original server online for end-users and in so doing to reduce the load on the original server? What is the real purpose of the task to be accomplished? > -----Original Message----- > From: pgsql-general-owner@xxxxxxxxxxxxxx [mailto:pgsql-general- > owner@xxxxxxxxxxxxxx] On Behalf Of Ben-Nes Yonatan > Sent: Monday, June 27, 2005 1:13 PM > To: pgsql-general@xxxxxxxxxxxxxx > Subject: Populating huge tables each day > > Hi All, > > First I apologize for the length of this email im just afraid that my > problem is quite complicated for explination & also about my english > which is not my native language.... sorry :) > > I'm currently building a site for a client which will need to update his > database on a daily routine with about 3.5 million rows from data files > (the files vary in size and can reach up to 1 million rows per file). > That data will be retrieved from several files (Text or XML by my > chioce) and part of the rows will need to be INSERT cause they doesnt > have any previous record and some will need to UPDATE previous records > or just INSERT and DELETE the previous records. > Beside of that the new data also bring information in it that tell how > to populate another table (2 fields and the result is alot less rows) > and connect between the two tables by foreign key which is written in > each row of the main table (the one with the 3.5 million rows). > Now the site must work 24/7 and it will probably have 2 servers which > will run PostreSQL (working as 1 database), the scripting language that > ill use is PHP if it change anything. > > I thought on how to accomplish this and I would like to receive comments > and ideas. > I'm mostly afraid from the stress it will make on the server during the > process and that the site will display working information all the time. > > A brief explaniation on the definitions ill use ahead: > 1. New data - The retrieved data from the files (Text or XML). > 2. Main table - The final table which need to hold about 3.5 million > rows, it can be empty before the process or can hold already information > that some of it need to get updated using the "New data" and the rest of > the "New data" need to get inserted into it while the previous data > which didnt got updated need to get deleted. > 3. Second table - The table which hold information that the data at the > "Main table" need to get connected to using foreign keys. > 4. Temp table - A temporary table which will hold the "New data" till it > will be ready to be INSERT/UPDATE the data at "Main table" (got the > exact same columns as the "Main table"). > 4. Temp table2 - A temporary table which is created by CREATE TEMPORARY > TABLE AS (former SELECT INTO). > > My plan: > 1. With a COPY FROM ill insert the data to the "Temp table" (all of the > files in Text format). > 2. Run at PL/pgSQL function: { > > A. Start transaction > B. DELETE the content of the current existing "Second table". > C. INSERT data into the "Second table" using the "Temp table" - each row > will be checked to its values and compared to check if they exist > already (SELECT) at the "Second table" and if not it will run an INSERT > to create it at the "Second table" - im afraid that this process will be > extremly heavy on the server. > D. DELETE the "Main table" rows. > E. With CREATE TEMPORARY TABLE AS ill create "Temp table2" which will > contain all of the information of the "Temp table" + a subquery will > retrieve the ID of the "Second table" for the foreign key - quite heavy > process i suspect. > F. INSERT the data from "Temp table2" to the "Main table". > G. End transaction + quit from PL/pgSQL. } > > 3. Delete all the files. > > Thanks alot in advance and again im sorry for the length of the mail :) > > Ben-Nes Yonatan > Canaan Surfing ltd. > http://www.canaan.net.il > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@xxxxxxxxxxxxxx so that your > message can get through to the mailing list cleanly ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq