On 12/31/06, Nikola Milutinovic <alokin1@xxxxxxxxx> wrote:
I would imagine the overhead here is sending the BEGIN/COMMIT (or the behavior could be version dependent), and the commit is forcing a log flush. According to the documentation multiple inserts have the potential of being flushed in one shot when the database wakes up to do a logflush automatically, so committing more frequently would actually appear to slow you down.
The time to commit is dependent on hardware, on my hardware it was around 40ms (which you have to question the precision of the calculation, clearly it is probably less than 40ms and that is an upper limit under load).
My experiment was with 8.2, default configuration, so there is room for improvement. It was installed from Devrim's RPM packages.
I agree, vacuuming is clearly important. I would also think if you are going to do a massive one-time update/delete as a maintenance item on a normally read only table that you should plan on doing a vacuum full to recover the space used by the "old" rows. Logically the fewer pages on disk, the less I/O that will result in scenarios where you are doing sequential scans and probably even many index scans.
It seems that the MVCC implementation would introduce fragmentation (with respect to a btree indexed field) if your table design had an indexed creation date field and you often range scanned on that field but also updated the record then over time the optimizer would less favor the index as the correlation approached 0. Obviously this is a great feature for a "last update date" field. :) Not so great if your primary queries are on a creation date field.
I would agree. I am also reusing the same statement handles (prepare once, execute many) with DBD::Pg. The benefit here appears to be that it prepares the cursor once (one time to parse and generate the execution plan), and executes the same plan multiple times. The difference in inserts was about 2000 inserts/s!
This is the one of the reasons why everyone keeps saying use COPY instead of INSERT, COPY is essentially a one time prepare and execute many.
Test #1 (prepare once, execute many):
4000 inserts 0.92 secs, 4368.84 inserts/s, commit 0.04 secs.
4000 inserts 0.93 secs, 4303.47 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4319.78 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4306.38 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4299.53 inserts/s, commit 0.02 secs.
4000 inserts 0.92 secs, 4345.44 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4298.67 inserts/s, commit 0.03 secs.
4000 inserts 0.91 secs, 4382.13 inserts/s, commit 0.04 secs.
4000 inserts 0.92 secs, 4347.44 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4314.66 inserts/s, commit 0.02 secs.
inserts/s (StdDev) = 28.7435
inserts/s (Mean) = 4328.6351
Test #2 (prepare/execute each time):
4000 inserts 1.92 secs, 2086.21 inserts/s, commit 0.04 secs.
4000 inserts 1.91 secs, 2092.67 inserts/s, commit 0.02 secs.
4000 inserts 1.91 secs, 2094.54 inserts/s, commit 0.05 secs.
4000 inserts 1.96 secs, 2042.55 inserts/s, commit 0.03 secs.
4000 inserts 1.91 secs, 2098.57 inserts/s, commit 0.04 secs.
4000 inserts 1.91 secs, 2098.91 inserts/s, commit 0.03 secs.
4000 inserts 1.91 secs, 2098.38 inserts/s, commit 0.03 secs.
4000 inserts 1.92 secs, 2083.94 inserts/s, commit 0.02 secs.
4000 inserts 1.95 secs, 2050.07 inserts/s, commit 0.03 secs.
4000 inserts 1.92 secs, 2086.14 inserts/s, commit 0.02 secs.
inserts/s (StdDev) = 19.2360
inserts/s (Mean) = 2083.1987
> 1. There is no difference (speed-wise) between committing every 1K or every 250K rows.
It was really some time ago, since I have experimented with this. My las experiment was on PG 7.2 or 7.3. I was inserting cca 800,000 rows. Inserting without transactions took 25 hrs. Inserting with 10,000 rows per transaction took about 2.5 hrs. So, the speedup was 10x. I have not experimented with the transaction batch size, but I suspect that 1,000 would not show much speedup.
I would imagine the overhead here is sending the BEGIN/COMMIT (or the behavior could be version dependent), and the commit is forcing a log flush. According to the documentation multiple inserts have the potential of being flushed in one shot when the database wakes up to do a logflush automatically, so committing more frequently would actually appear to slow you down.
The time to commit is dependent on hardware, on my hardware it was around 40ms (which you have to question the precision of the calculation, clearly it is probably less than 40ms and that is an upper limit under load).
My experiment was with 8.2, default configuration, so there is room for improvement. It was installed from Devrim's RPM packages.
> 2. Vacuuming also makes no difference for a heavy insert-only table, only slows it down.
Makes sense. Since my application was dumping all records each month and inserting new ones, vacuum was really needed, but no speedup.
I agree, vacuuming is clearly important. I would also think if you are going to do a massive one-time update/delete as a maintenance item on a normally read only table that you should plan on doing a vacuum full to recover the space used by the "old" rows. Logically the fewer pages on disk, the less I/O that will result in scenarios where you are doing sequential scans and probably even many index scans.
It seems that the MVCC implementation would introduce fragmentation (with respect to a btree indexed field) if your table design had an indexed creation date field and you often range scanned on that field but also updated the record then over time the optimizer would less favor the index as the correlation approached 0. Obviously this is a great feature for a "last update date" field. :) Not so great if your primary queries are on a creation date field.
> 3. Table size plays no real factor.The reason I saw speedup, must have to do with the fact that without transactions, each insert was it's own transaction. That was eating resources.
I would agree. I am also reusing the same statement handles (prepare once, execute many) with DBD::Pg. The benefit here appears to be that it prepares the cursor once (one time to parse and generate the execution plan), and executes the same plan multiple times. The difference in inserts was about 2000 inserts/s!
This is the one of the reasons why everyone keeps saying use COPY instead of INSERT, COPY is essentially a one time prepare and execute many.
Test #1 (prepare once, execute many):
4000 inserts 0.92 secs, 4368.84 inserts/s, commit 0.04 secs.
4000 inserts 0.93 secs, 4303.47 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4319.78 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4306.38 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4299.53 inserts/s, commit 0.02 secs.
4000 inserts 0.92 secs, 4345.44 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4298.67 inserts/s, commit 0.03 secs.
4000 inserts 0.91 secs, 4382.13 inserts/s, commit 0.04 secs.
4000 inserts 0.92 secs, 4347.44 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4314.66 inserts/s, commit 0.02 secs.
inserts/s (StdDev) = 28.7435
inserts/s (Mean) = 4328.6351
Test #2 (prepare/execute each time):
4000 inserts 1.92 secs, 2086.21 inserts/s, commit 0.04 secs.
4000 inserts 1.91 secs, 2092.67 inserts/s, commit 0.02 secs.
4000 inserts 1.91 secs, 2094.54 inserts/s, commit 0.05 secs.
4000 inserts 1.96 secs, 2042.55 inserts/s, commit 0.03 secs.
4000 inserts 1.91 secs, 2098.57 inserts/s, commit 0.04 secs.
4000 inserts 1.91 secs, 2098.91 inserts/s, commit 0.03 secs.
4000 inserts 1.91 secs, 2098.38 inserts/s, commit 0.03 secs.
4000 inserts 1.92 secs, 2083.94 inserts/s, commit 0.02 secs.
4000 inserts 1.95 secs, 2050.07 inserts/s, commit 0.03 secs.
4000 inserts 1.92 secs, 2086.14 inserts/s, commit 0.02 secs.
inserts/s (StdDev) = 19.2360
inserts/s (Mean) = 2083.1987