Re: Cluster using tablespaces?

Alvaro Herrera <alvherre@xxxxxxxxxxxxxx> · Sun, 2 Dec 2007 10:56:27 -0300

Rainer Bauer wrote:
> Alvaro Herrera wrote:

> >It has been theorized that cluster would be faster in general if instead
> >of doing an indexscan we would instead use a seqscan + sort step.  It
> >would be good to measure it.
> 
> Could a reindex on the clustered index speed up the clustering (when executed
> immediatelly before the cluster command)? As I understand it, this index is
> used to fetch the table data in the correct order. Or is most of the time
> spend fetching the table data?

I haven't measured it, but my guess is that most of the time is in
fetching heap pages in random order.

> Also, would it make sense to increase <shared_buffers> for the cluster
> operation. This is set to 32MB here on my Windows box as was recommended.

Not sure.  In general yes, but on Windows things are different.

> >> >For btree indexes, there is a temporary copy of the index data, which
> >> >will go wherever you have arranged for temp files to go.  (I think that
> >> >easy user control of this may be new for 8.3, though.)
> >> 
> >> Could you give me a hint where that would be on Windows? I guess this might be
> >> worth a try since there are a couple of btree indexes in the database.
> >
> >I think Tom is referring to the new temp_tablespaces config variable.
> 
> I moved the pgsql_tmp directory to another disk, but that didn't speed up the
> cluster command.

Probably most of the time is going into creating the new table then.

If you are looking for a short-term solution to your problem, maybe the
best is to follow the recommendation on CLUSTER ref page:

	There is another way to cluster data. The CLUSTER command
	reorders the original table by scanning it using the index you
	specify. This can be slow on large tables because the rows are
	fetched from the table in index order, and if the table is
	disordered, the entries are on random pages, so there is one
	disk page retrieved for every row moved. (PostgreSQL has a
	cache, but the majority of a big table will not fit in the
	cache.) The other way to cluster a table is to use

	CREATE TABLE newtable AS
	    SELECT * FROM table ORDER BY columnlist;

	which uses the PostgreSQL sorting code to produce the desired
	order; this is usually much faster than an index scan for
	disordered data. Then you drop the old table, use ALTER TABLE
	... RENAME to rename newtable to the old name, and recreate the
	table's indexes. The big disadvantage of this approach is that
	it does not preserve OIDs, constraints, foreign key
	relationships, granted privileges, and other ancillary
	properties of the table — all such items must be manually
	recreated. Another disadvantage is that this way requires a sort
	temporary file about the same size as the table itself, so peak
	disk usage is about three times the table size instead of twice
	the table size. 

-- 
Alvaro Herrera                 http://www.amazon.com/gp/registry/CTMLCN8V17R4
"La experiencia nos dice que el hombre peló millones de veces las patatas,
pero era forzoso admitir la posibilidad de que en un caso entre millones,
las patatas pelarían al hombre" (Ijon Tichy)

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend