Hi. We are running a data processing/analysis pipeline that writes about 100K records to two tables on a daily basis. The pipeline runs from about 6:00 a.m. to 10:00 a.m. Our user base is small - about five people. Each accesses the database in a different way (generally using some script - either Perl or Python). Some people begin querying the database as soon as the new data/analysis results start being loaded. Others wait until the day's run is complete, so the number of concurrent users is small at this time. The data/analysis results are loaded into two tables from two files of 200 to 1,000 rows each using the COPY command, which is executed from a Perl script that uses DBD-Pg. Other details: Postgres 8.3.7 running on a Linux system with eight processors. Both of the big tables (now up to > 15 M rows each) have indexes on several of the columns. The indexes were created using CREATE INDEX CONCURRENTLY... Both tables have one or two foreign key constraints. My questions are: (1) At the point that the data are being loaded into the tables, are the new data indexed? (2) Should I REINDEX these two tables daily after the pipeline completes? Is this what other people do in practice? (3) Currently the pipeline executes in serial fashion. We'd like to cut the wall clock time down as much as possible. The data processing and data analysis can be done in parallel, but can the loading of the database be done in parallel, i.e., can I execute four parallel COPY commands from four copies of a script? Our initial attempt at doing this failed. I found one posting in the archives about parallel COPY, but it doesn't seem to be quite on point. (4) Does COPY lock the table? Do I need to explicitly LOCK the table before the COPY command? Does LOCK even apply to using COPY? If I used table locking, would parallel COPY work? (5) If I drop the indexes and foreign key constraints, then is it possible to COPY to a table from more than one script, i.e., do parallel COPY? It seems like a really bad idea to drop those foreign key constraints. Should re-think about where our database loading fits into the overall pipeline, i.e., do the data processing and analysis in parallel, but keep the data loading sequential? The reason for not doing all of the data loading at the end is that some of the users *really* want to start querying the data and analysis results as soon as they enter the database. Looking forward to your replies. Janet -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general