Re: COPY Performance

"Hans Zaunere" <lists@xxxxxxxxxxx> · Mon, 5 May 2008 08:18:07 -0400

> > We're using a statement like this to dump between 500K and >5 million
> > rows.
> 
> > COPY(SELECT SomeID FROM SomeTable WHERE SomeColumn > '0')
> >   TO '/dev/shm/SomeFile.csv'
> 
> > Upon first run, this operation can take several minutes.  Upon second
> > run, it will be complete in generally well under a minute.
> 
> Hmmm ... define "first" versus "second".  What do you do to return it
> to the slow state?

Interesting that you ask.  I haven't found a very reliable way to reproduce
this.

Typically, just waiting a while to run the same query the second time will
reproduce this behavior.  I restarted postgresql and it was reproduced as
well.  However, I can't find a way to flush buffers/etc, to reproduce the
problem on-demand.  The first time vs second time is usually simply defined
as the temp. CSV file being created slowly once, and then fast, although
frankly sometimes it's always slow, and sometimes seemingly a bit quicker,
depending on the query.

As I mentioned in my other post, the query is always of the simple form
above - single column greater than or equal to a single value (int or single
byte char).

We had set shared_buffers to 1028M, which at first seemed to make a
significant difference.  With some testing, various queries ran under 20
seconds, the "first time", and the second time at the same speed or a bit
quicker.

However, operations then restarted the system (and downgraded to Red Hat
5.0) and now we're back where we started.  Even with the increased
shared_buffers, things are slow - perhaps I'll try to increase it more.

Upon first starting postgresql, and executing the above statement, the
process begins to grow to 1gb and then stops.  When it stops, the CSV file
starts to "trickle" out to disk.

Best,

H