Re: questions on (parallel) COPY and when to REINDEX

Janet Jacobsen <jsjacobsen@xxxxxxx> · Sun, 02 Aug 2009 17:25:48 -0700

Thanks for your reply.  Responses below, and one follow-up
question about when/how often to use analyze.

Janet

On 02/08/2009 05:09 a.m., Alban Hertroys wrote:
On 1 Aug 2009, at 23:24, Janet Jacobsen wrote:
My questions are:
(2) Should I REINDEX these two tables daily after the pipeline
completes?  Is this what other people do in practice?

No need, but as people are querying as soon as data is arriving, an 
analyse of the table you just copied to should help performance - the 
query plans will be derived from more accurate table statistics that way.

The files that are being copied into the table(s) are between 200 to 1,000
each, but there are hundreds of these small files every day.  Would you
recommend running analyze after every COPY? 

(3) Currently the pipeline executes in serial fashion.  We'd
like to cut the wall clock time down as much as possible.
The data processing and data analysis can be done in parallel,
but can the loading of the database be done in parallel, i.e.,
can I execute four parallel COPY commands from four copies
of a script?  Our initial attempt at doing this failed.  I found one
posting in the archives about parallel COPY, but it doesn't seem
to be quite on point.

As long as you're not using the same database-session in parallel you 
should be fine. You can't do parallel database operations in the same 
session. Last time I did something similar I used separate database 
connections.

Yes, they are separate database connections because the connection
is made in the data loading script.

(5) If I drop the indexes and foreign key constraints, then is it
possible to COPY to a table from more than one script, i.e., do
parallel COPY?  It seems like a really bad idea to drop those
foreign key constraints.

You can COPY in parallel, but having no FK's does help 
insert-performance. In that case whether you should or shouldn't 
remove your FK's depends on what's more important to you; insert 
performance or data correctness.

As some of your users query the data while it's still coming in I 
guess that data correctness is in fact more important to you and you 
should keep the FK's.

Yes, I think we should keep the FK's.

You wrote that your pipeline runs for a period of 4 hours and the 
table is about 15M rows now. What is taking up all that time? I 
understand why you'd want to parallelise that process, but where do 
you expect the gains?

We're processing images, and the data processing and analysis takes
up most of the time, but the images can be processed/analyzed in parallel.
We've been doing all of the data loading at the end - one COPY at a
time.  Originally that made sense because the researchers wanted to
check the images before loading the data/analysis results into the db.

But now we want to load the data/analysis results as soon as they are
available, so if there are four copies of the processing/analysis script
running, we want to have the data loading script initiated at the end of
each processing/analysis script, which is equivalent to four COPYs
writing four different files into the same table.

In terms of time, we will not see a huge decrease in the wall clock time,
but the data/analysis results will be available to the users much sooner,
and this matters a lot to the researchers.

From the above I'm guessing that part of the problem you want to solve 
by parallelising is insert-performance. In cases like these I've seen 
people with problems with the look-up speed of foreign keys because 
the database chose a bad query plan. Regular analyses during inserting 
data should prevent that.

Okay, so I can try this.

Should re-think about where our database loading fits into the
overall pipeline, i.e., do the data processing and analysis in
parallel, but keep the data loading sequential?  The reason for
not doing all of the data loading at the end is that some of the
users *really* want to start querying the data and analysis
results as soon as they enter the database.

You seem to have a design where the database grows indefinitely? Your 
database doesn't look to be very big yet, but I expect that at some 
point in the future your data will grow so much that the ordering of 
it on disk starts to matter for how fast records can be fetched. 
That's a typical scenario where people here start to advise using 
table partitioning.

The thing is though that partitioning only works well if the queries 
your users perform contain clauses of which the database knows they 
divide the data (the same clauses used to partition the table).

The people you work with apparently are knowledgeable enough that they 
create their own queries. They will have to take partition constraints 
into account too if you choose to use table partitioning.

Yes, there are a couple of tables that it seems will grow indefinitely.
We are not in production yet, so there is some question about
should so much be going into the database.  For now the answer is
"yes, everything".

Last week or so I posted a query about table partitioning, and I got
some good suggestions having to do with partitioning and partial
indexes, so I think both are in our future. 

Yes, the researchers do write their own queries, so I think with
adequate explanation, they could take advantage of table partitions.
I need to think a lot about what the right partitions are, however.

Thanks,
Janet

Regards,
Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general