Re: Populating huge tables each day

"Dann Corbit" <DCorbit@xxxxxxxxx> · Mon, 27 Jun 2005 12:43:57 -0700

I see a lot of problems with this idea.

You mention that the database is supposed to be available 24x7.
While you are loading, the database table receiving data will not be
available.  Therefore, you will have to have one server online (with
only the old data), while the other one is loading.  Once the load and
all procedures are complete, you could switch the active server.

You do not mention your time constraints and the total volume of data.
If the new data must become available at some critical time, this is a
very important requirement that must be spelled out clearly.

You will need some kind of guarantee of relational integrity.  This is
always difficult to achieve when bulk loading from foreign sources.

I think it is important to spell things out more clearly.

How many tables are to be replicated?
What is the total number of expected rows for each table?
How fast are the tables expected to grow?
When must the new data become available online?
Are all of the tables in the database populated from a foreign source or
just some of them?
Do you also have access to the source data in its database format, or
only as text dumps?

Is the goal to serve as a reporting server?  Is the goal to make the
same data as the original server online for end-users and in so doing to
reduce the load on the original server?  What is the real purpose of the
task to be accomplished?

> -----Original Message-----
> From: pgsql-general-owner@xxxxxxxxxxxxxx [mailto:pgsql-general-
> owner@xxxxxxxxxxxxxx] On Behalf Of Ben-Nes Yonatan
> Sent: Monday, June 27, 2005 1:13 PM
> To: pgsql-general@xxxxxxxxxxxxxx
> Subject:  Populating huge tables each day
> 
> Hi All,
> 
> First I apologize for the length of this email im just afraid that my
> problem is quite complicated for explination & also about my english
> which is not my native language.... sorry :)
> 
> I'm currently building a site for a client which will need to update
his
> database on a daily routine with about 3.5 million rows from data
files
> (the files vary in size and can reach up to 1 million rows per file).
> That data will be retrieved from several files (Text or XML by my
> chioce) and part of the rows will need to be INSERT cause they doesnt
> have any previous record and some will need to UPDATE previous records
> or just INSERT and DELETE the previous records.
> Beside of that the new data also bring information in it that tell how
> to populate another table (2 fields and the result is alot less rows)
> and connect between the two tables by foreign key which is written in
> each row of the main table (the one with the 3.5 million rows).
> Now the site must work 24/7 and it will probably have 2 servers which
> will run PostreSQL (working as 1 database), the scripting language
that
> ill use is PHP if it change anything.
> 
> I thought on how to accomplish this and I would like to receive
comments
> and ideas.
> I'm mostly afraid from the stress it will make on the server during
the
> process and that the site will display working information all the
time.
> 
> A brief explaniation on the definitions ill use ahead:
> 1. New data - The retrieved data from the files (Text or XML).
> 2. Main table - The final table which need to hold about 3.5 million
> rows, it can be empty before the process or can hold already
information
> that some of it need to get updated using the "New data" and the rest
of
> the "New data" need to get inserted into it while the previous data
> which didnt got updated need to get deleted.
> 3. Second table - The table which hold information that the data at
the
> "Main table" need to get connected to using foreign keys.
> 4. Temp table - A temporary table which will hold the "New data" till
it
> will be ready to be INSERT/UPDATE the data at "Main table" (got the
> exact same columns as the "Main table").
> 4. Temp table2 - A temporary table which is created by CREATE
TEMPORARY
> TABLE AS (former SELECT INTO).
> 
> My plan:
> 1. With a COPY FROM ill insert the data to the "Temp table" (all of
the
> files in Text format).
> 2. Run at PL/pgSQL function: {
> 
> A. Start transaction
> B. DELETE the content of the current existing "Second table".
> C. INSERT data into the "Second table" using the "Temp table" - each
row
> will be checked to its values and compared to check if they exist
> already (SELECT) at the "Second table" and if not it will run an
INSERT
> to create it at the "Second table" - im afraid that this process will
be
> extremly heavy on the server.
> D. DELETE the "Main table" rows.
> E. With CREATE TEMPORARY TABLE AS ill create "Temp table2" which will
> contain all of the information of the "Temp table" + a subquery will
> retrieve the ID of the "Second table" for the foreign key - quite
heavy
> process i suspect.
> F. INSERT the data from "Temp table2" to the "Main table".
> G. End transaction + quit from PL/pgSQL. }
> 
> 3. Delete all the files.
> 
> Thanks alot in advance and again im sorry for the length of the mail
:)
> 
> Ben-Nes Yonatan
> Canaan Surfing ltd.
> http://www.canaan.net.il
> 
> 
> ---------------------------(end of
broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@xxxxxxxxxxxxxx so that
your
>        message can get through to the mailing list cleanly

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq