Search Postgresql Archives

Re: questions on (parallel) COPY and when to REINDEX

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 1 Aug 2009, at 23:24, Janet Jacobsen wrote:

Both of the big tables (now up to > 15 M rows each) have
indexes on several of the columns.  The indexes were
created using CREATE INDEX CONCURRENTLY...
Both tables have one or two foreign key constraints.

Others have answered some of your questions already, so I snipped a few.

My questions are:
(2) Should I REINDEX these two tables daily after the pipeline
completes?  Is this what other people do in practice?

No need, but as people are querying as soon as data is arriving, an analyse of the table you just copied to should help performance - the query plans will be derived from more accurate table statistics that way.

(3) Currently the pipeline executes in serial fashion.  We'd
like to cut the wall clock time down as much as possible.
The data processing and data analysis can be done in parallel,
but can the loading of the database be done in parallel, i.e.,
can I execute four parallel COPY commands from four copies
of a script?  Our initial attempt at doing this failed.  I found one
posting in the archives about parallel COPY, but it doesn't seem
to be quite on point.

As long as you're not using the same database-session in parallel you should be fine. You can't do parallel database operations in the same session. Last time I did something similar I used separate database connections.

(5) If I drop the indexes and foreign key constraints, then is it
possible to COPY to a table from more than one script, i.e., do
parallel COPY?  It seems like a really bad idea to drop those
foreign key constraints.

You can COPY in parallel, but having no FK's does help insert- performance. In that case whether you should or shouldn't remove your FK's depends on what's more important to you; insert performance or data correctness.

As some of your users query the data while it's still coming in I guess that data correctness is in fact more important to you and you should keep the FK's.

You wrote that your pipeline runs for a period of 4 hours and the table is about 15M rows now. What is taking up all that time? I understand why you'd want to parallelise that process, but where do you expect the gains?

From the above I'm guessing that part of the problem you want to solve by parallelising is insert-performance. In cases like these I've seen people with problems with the look-up speed of foreign keys because the database chose a bad query plan. Regular analyses during inserting data should prevent that.

Should re-think about where our database loading fits into the
overall pipeline, i.e., do the data processing and analysis in
parallel, but keep the data loading sequential?  The reason for
not doing all of the data loading at the end is that some of the
users *really* want to start querying the data and analysis
results as soon as they enter the database.

You seem to have a design where the database grows indefinitely? Your database doesn't look to be very big yet, but I expect that at some point in the future your data will grow so much that the ordering of it on disk starts to matter for how fast records can be fetched. That's a typical scenario where people here start to advise using table partitioning.

The thing is though that partitioning only works well if the queries your users perform contain clauses of which the database knows they divide the data (the same clauses used to partition the table).

The people you work with apparently are knowledgeable enough that they create their own queries. They will have to take partition constraints into account too if you choose to use table partitioning.

Looking forward to your replies.

Janet

Regards,
Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


!DSPAM:737,4a7581ec10134875916639!



--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux