Re: Large tables, ORDER BY and sequence/index scans

Filip Rembiałkowski <plk.zuber@xxxxxxxxx> · Tue, 5 Jan 2010 13:02:17 +0100

2010/1/5 Milan Zamazal <pdm@xxxxxxxxxxxx>

My problem is that retrieving sorted data from large tables is sometimes

very slow in PostgreSQL (8.4.1, FWIW).

I typically retrieve the data using cursors, to display them in UI:

  BEGIN;

  DECLARE ... SELECT ... ORDER BY ...;

  FETCH ...;

  ...

On a newly created table of about 10 million rows the FETCH command

takes about one minute by default, with additional delay during the

contingent following COMMIT command.  This is because PostgreSQL uses

sequence scan on the table even when there is an index on the ORDER BY

column.  When I can force PostgreSQL to perform index scan (e.g. by

setting one of the options enable_seqscan or enable_sort to off), FETCH

response is immediate.

PostgreSQL manual explains motivation for sequence scans of large tables

and I can understand the motivation.  Nevertheless such behavior leads

to unacceptably poor performance in my particular case.  It is important

to get first resulting rows quickly, to display them to the user without

delay.

My questions are:

- What is your experience with using ORDER BY + indexes on large tables?

Without a WHERE clause postgres will almost always choose a sequential scan.

- Is there a way to convince PostgreSQL to use index scans automatically

  in cases where it is much more efficient?  I tried using ANALYZE,

  VACUUM and SET STATISTICS, but without success.

By using cursors you take some responsibility away from the planner.
It has no idea that you want first 100 rows quickly. It just tries to optimize the whole operation.

- Is it a good idea to set enable_seqscan or enable_sort to "off"

  globally in my case?  Or to set them to "off" just before working with

  large tables?  My databases contain short and long tables, often

  connected through REFERENCES or joined into views and many of shorter

  tables serve as codebooks.  Can setting one of the parameters to off

  have clearly negative impacts?

IMHO, no, no and yes.

- Is there a recommended way to keep indexes in good shape so that the

  performance of initial rows retrievals remains good?  The large tables

  are typically append-only tables with a SERIAL primary key.

Use partitioning.
If that's not possible, REINDEX periodically to avoid sub-optimal btree layout. But that's just a half-solution.

Thanks for any tips.

tips:

1. get rid of cursors, unless you have a strong need for them (eg. seeking back and forth and updating).

2. switch to "chunked" processing, like this:

SELECT * FROM bigtable ORDER by idxcol LIMIT 1000;
(process the records)
SELECT * FROM bigtable WHERE idxcol > [last idxcol from previous fetch] ORDER by idxcol LIMIT 1000;

... and so on.

pozdrawiam,
Filip

-- 
Filip Rembiałkowski
JID,mailto:filip.rembialkowski@xxxxxxxxx
http://filip.rembialkowski.net/