Janet Jacobsen wrote:
Hi. We are running a data processing/analysis pipeline that writes about 100K records to two tables on a daily basis. The pipeline runs from about 6:00 a.m. to 10:00 a.m. Our user base is small - about five people. Each accesses the database in a different way (generally using some script - either Perl or Python). Some people begin querying the database as soon as the new data/analysis results start being loaded. Others wait until the day's run is complete, so the number of concurrent users is small at this time. The data/analysis results are loaded into two tables from two files of 200 to 1,000 rows each using the COPY command, which is executed from a Perl script that uses DBD-Pg. Other details: Postgres 8.3.7 running on a Linux system with eight processors. Both of the big tables (now up to > 15 M rows each) have indexes on several of the columns. The indexes were created using CREATE INDEX CONCURRENTLY... Both tables have one or two foreign key constraints. My questions are: (1) At the point that the data are being loaded into the tables, are the new data indexed?
it depends if an index exists on the table when you fill it with data. If there is an index, it will be updated.
(2) Should I REINDEX these two tables daily after the pipeline completes? Is this what other people do in practice?
it depends if an index exists on the table when you fill it with data. But I repeat myself :-). If an index exists you would not need to reindex it. It may be faster to fill a table without an index, then add an index later. But that would depend on if you need the index for unique constraints.
(3) Currently the pipeline executes in serial fashion. We'd like to cut the wall clock time down as much as possible. The data processing and data analysis can be done in parallel, but can the loading of the database be done in parallel, i.e., can I execute four parallel COPY commands from four copies
We'd need more specifics. Are you COPY'ing into two different tables at once? (that should work). Or the same table with different data (that should work too, I'd guess) or the same data with a unique key (that'll break)?
Our initial attempt at doing this failed.
What was the error?
I found one posting in the archives about parallel COPY, but it doesn't seem to be quite on point.
They have added parallel copy to the pg_restore, but I think that does different tables, not the same table. Was that what you saw?
(4) Does COPY lock the table? Do I need to explicitly LOCK the table before the COPY command? Does LOCK even apply to using COPY? If I used table locking, would parallel COPY work?
pg does not need to lock tables. Locking is counter productive to multiuser access. Why would you think locking a table would let parallel copy work? A lock is to give one process exclusive access to a table. Locking is exactly what you dont want.
(5) If I drop the indexes and foreign key constraints, then is it possible to COPY to a table from more than one script, i.e., do parallel COPY? It seems like a really bad idea to drop those foreign key constraints.
It would be a bad idea yes. One thing that could stop you is a unique constraint and two copy's are inserting the same data. What sort of errors did you get last time you tried this? I have never tried two processes copy'ing into the same table at the same time, but I'd bet its possible. -Andy -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general