questions on (parallel) COPY and when to REINDEX

Janet Jacobsen <jsjacobsen@xxxxxxx> · Sat, 01 Aug 2009 14:24:41 -0700

Hi.  We are running a data processing/analysis pipeline that
writes about 100K records to two tables on a daily basis.
The pipeline runs from about 6:00 a.m. to 10:00 a.m.

Our user base is small - about five people.  Each accesses
the database in a different way (generally using some script
- either Perl or Python).

Some people begin querying the database as soon as the new
data/analysis results start being loaded.  Others wait until the
day's run is complete, so the number of concurrent users is
small at this time.

The data/analysis results are loaded into two tables from two
files of 200 to 1,000 rows each using the COPY command,
which is executed from a Perl script that uses DBD-Pg.

Other details: Postgres 8.3.7 running on a Linux system
with eight processors.

Both of the big tables (now up to > 15 M rows each) have
indexes on several of the columns.  The indexes were
created using CREATE INDEX CONCURRENTLY...
Both tables have one or two foreign key constraints.

My questions are:
(1) At the point that the data are being loaded into the tables,
are the new data indexed?
(2) Should I REINDEX these two tables daily after the pipeline
completes?  Is this what other people do in practice?
(3) Currently the pipeline executes in serial fashion.  We'd
like to cut the wall clock time down as much as possible.
The data processing and data analysis can be done in parallel,
but can the loading of the database be done in parallel, i.e.,
can I execute four parallel COPY commands from four copies
of a script?  Our initial attempt at doing this failed.  I found one
posting in the archives about parallel COPY, but it doesn't seem
to be quite on point.
(4) Does COPY lock the table?  Do I need to explicitly
LOCK the table before the COPY command?  Does LOCK
even apply to using COPY?  If I used table locking, would
parallel COPY work?
(5) If I drop the indexes and foreign key constraints, then is it
possible to COPY to a table from more than one script, i.e., do
parallel COPY?  It seems like a really bad idea to drop those
foreign key constraints.

Should re-think about where our database loading fits into the
overall pipeline, i.e., do the data processing and analysis in
parallel, but keep the data loading sequential?  The reason for
not doing all of the data loading at the end is that some of the
users *really* want to start querying the data and analysis
results as soon as they enter the database.

Looking forward to your replies.

Janet

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general