Markus Bertheau wrote:
2008/2/19, Richard Huxton <dev@xxxxxxxxxxxx>:
I'm loading a table with some short attributes and a large toastable attribute.
That means that for every main table heap page several toast table heap pages
are written. This happens through the buffer cache and the background writer,
so maybe the pages aren't written in the order in which they were created in
the buffer cache, but if they are, they end up on disk (assuming that the file
system is not fragmented) roughly like that:
main table heap page 1
toast table heap page 1
toast table heap page .
toast table heap page n
main table heap page 2
toast table heap page n+1
toast table heap page .
toast table heap page 2n
Well, that's assuming:
1. You're not re-using space from previously deleted/updated rows.
2. You've not got a RAID array striping writes over multiple disks
3. The underlying filesystem + buffering isn't doing anything too clever.
Significantly later a sequential scan of the table has to be made, the
toastable attribute is not needed for the operation. The caches are cold or
otherwise occupied.
OK
> If the granularity of caches that are nearer to the disk in
the cache hierarchy than the PG buffer cache is higher than the PG page size
(i.e. a cache unit is bigger than the PG page size), then every read of a main
table heap page will inescapably read some toast table heap pages into the
cache (whichever cache that may be).
I think disk blocks on your more common file-systems are 4KB by default
(ext2/3 and ntfs for example). I'm not aware of any default disk-block
sizes more than the 8KB page-size of PG. Of course, the OS may read
ahead if it sees you scanning, but it will do that on a file basis.
> If all the main table heap pages were
laid out adjecently on disk, they could be read faster and caches be polluted
less.
True enough. The key word there though is "if" - it means that PG is
trying to out-think the filesystem, OS and hardware.
It should be easy enough to test on a particular system though.
1. pg_restore a sample table with TOASTed data.
2. pg_restore the same data but no TOASTed data.
3. cluster the table with TOASTed data (which should force a rewrite of
the whole table but not its TOASTed data)
If the timing of various selects differ hugely then there's something
worth investigating.
--
Richard Huxton
Archonet Ltd
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match