Re: pg_dump order of rows

Hampus Wessman <hampus.wessman@xxxxxxxxx> · Mon, 13 Dec 2010 15:48:47 +0100

Not at all. He wanted a way to influence the on-disk order of the rows 
in a table, so he could make sure they were not left in the order of 
insertion. That should be possible with CLUSTER. From the documentation: 
"When a table is clustered, it is physically reordered based on the 
index information" (again, see 
http://www.postgresql.org/docs/9.0/static/sql-cluster.html for full 
details).

Now, if one would give CLUSTER an index that put the data back in 
insertion order, then it wouldn't be very useful (obviously). On the 
other hand, if there is such an index, then the insertion order of the 
rows can easily be restored no matter how they are reordered... (making 
the whole thing pointless in this case).

So, what index to choose? He can either cluster on an existing index, if 
that would remove all trace of the insertion order (any field with 
unique values would work equally well), or he can add a field with 
random values, index that field and cluster on that index. CLUSTER is 
just the tool... Unless I've missed something fundamental about the 
CLUSTER command it should be perfectly possible to reorder the data on 
disk in any way you want with it, if you first create an appropriate 
index to cluster on.

The downsides I see is (1 & 2 directly from the documentation):
1. When a table is being clustered, an ACCESS EXCLUSIVE lock is acquired 
on it.
2. You need free space on disk at least equal to the sum of the table 
size and the index sizes.
3. Rows inserted after CLUSTER finishes will not be "protected". You 
would need to run it before a dump and make sure nothing is written to 
the table between the clustering and the dumping, if you want to be 
really sure...

Any reason why this wouldn't work?

On 2010-12-13 14:53, Vick Khera wrote:
On Mon, Dec 13, 2010 at 4:26 AM, Hampus Wessman
<hampus.wessman@xxxxxxxxx>  wrote:
CLUSTER should be able to do that for you. See
http://www.postgresql.org/docs/9.0/static/sql-cluster.html.

That's the exact opposite of what he wished to accomplish.

The only thing I can suggest is to do random order of updates, perhaps
while also setting the fill factor for the pages to something large so
there's no room to re-use a page, causing more shuffling.

I suspect you'll have to do this for every table you've got, else you
could leak some information about what rows were updated since the
last import if the person has access to the file used to import the
data originally.  You'll also have to do this every time you dump the
DB, I'd suspect...

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general