Re: questions on (parallel) COPY and when to REINDEX

andy <andy@xxxxxxxxxxxxxxx> · Sun, 02 Aug 2009 19:36:45 -0500

Janet Jacobsen wrote:
Thanks for your reply.  Responses below to answer your questions,
and one follow-up question on REINDEX.

Janet

Both tables have one or two foreign key constraints.

(2) Should I REINDEX these two tables daily after the pipeline
completes?  Is this what other people do in practice?

it depends if an index exists on the table when you fill it with 
data.  But I repeat myself :-).  If an index exists you would not need 
to reindex it.  It may be faster to fill a table without an index, 
then add an index later.  But that would depend on if you need the 
index for unique constraints.

Ok.  Since data loading occurs daily, and the users query the table
while the data loading is going on, it seems like I should not drop
the indexes before the daily loading. 

I re-read the REINDEX pages.  I see the following statement,

    "Also, for B-tree indexes a freshly-constructed index is somewhat
    faster to access than one that has been updated many times, because
    logically adjacent pages are usually also physically adjacent in a
    newly built index. (This consideration does not currently apply to
    non-B-tree indexes.) It might be worthwhile to reindex periodically
    just to improve access speed."

This quote says "has been updated many times" and "worthwhile to index
periodically".  I'm not sure how to interpret "many times" and 
"periodically".

In our case, on a daily basis, 100K rows or so are added to two tables,
and a later script does 100K updates on one of the table. Does that make
us a candidate for daily REINDEXing?

Its tough to say.  I'd guess not every day.  Once a month?  The best way to find out is to do some timing.  Do a few indexed select statements and 'explain analyze' them.  See what the numbers tell you.  Then REINDEX and do the same test.  Then wait a month and try the same test.  See if its much slower.

The difference between having an index and not is hugely huge orders of magnitude.  The difference between a balanced index and unbalanced is minor.

A vacuum analyze might be more important than a reindex, depending on how many indexes you have, it will update the stats about the indexes and help the planner pick the best index.

I cant answer as to what others do, my pg database is 25meg.  Yes meg.  And 260K rows.  Its embarrassing.  By next month I'll probably be up to 260.5K rows.  So I really have no experience with the volume of data your pushing around.

(3) Currently the pipeline executes in serial fashion.  We'd
like to cut the wall clock time down as much as possible.
The data processing and data analysis can be done in parallel,
but can the loading of the database be done in parallel, i.e.,
can I execute four parallel COPY commands from four copies

We'd need more specifics.  Are you COPY'ing into two different tables 
at once?  (that should work).  Or the same table with different data 
(that should work too, I'd guess) or the same data with a unique key 
(that'll break)?

We'd like to run four identical scripts in parallel, each of which will
copy a different file into the same table.
Our initial attempt at doing this failed.  

What was the error?

If the return status from trying to do the COPY is 7, the script prints
a failure message, i.e., not the Postgres error. I will set up a test on
a test case to get more information.  (I didn't run the initial try.)

COPY wont return 7.  Not sure where the 7 comes from.  The copy may fail and whatever language your programming in my raise an exception, which might be numbered 7... I suppose.  Multiple copy's into the same table at the same time should work.  I think the error was on your side.

Note that while you are in a transaction your clients wont be able to see any of the data until you commit.  Since some of them want at the data asap, you might want to break up your copy's with a few commits.  I sur'pose tha'd depend on how long it all takes though.

I found one
posting in the archives about parallel COPY, but it doesn't seem
to be quite on point.

They have added parallel copy to the pg_restore, but I think that does 
different tables, not the same table.  Was that what you saw?

Yes, I think so.  The reference is to "Andrews parallel restore patch":
http://archives.postgresql.org/pgsql-hackers/2008-09/msg01711.php
The subject line is "lock contention on parallel COPY ?"

Yeah, that's an internal lock on some really small variable deep in the guts of pg core.  Not an entire table lock.

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general