Re: PostgreSQL reads each 8k block - no larger blocks are used - even on sequential scans

Gerhard Wiesinger <lists@xxxxxxxxxxxxx> · Sat, 3 Oct 2009 09:40:59 +0200 (CEST)

On Fri, 2 Oct 2009, Simon Riggs wrote:

On Sun, 2009-09-27 at 18:05 +0200, Gerhard Wiesinger wrote:

So I saw, that even on sequential reads (and also on bitmap heap scan acces)
PostgreSQL uses only 8k blocks. I think that's a major I/O bottleneck.

A commercial software database vendor solved the problem by reading multiple
continuous blocks by multiple 8k blocks up to a maximum threshold. Output per 5
seconds on an equivalent "sequence scan":

Is systemtap counting actual I/Os or just requests to access 8192 blocks
once in OS cache? Postgres doesn't read more than one block at a time
into its buffer pool, so those numbers of requests look about right.

As far as I know these are VFS reads. So some reads might be from cache 
but since I did all requests should be from disk:
echo 3 > /proc/sys/vm/drop_caches;service postgresql restart
do benchmark

Same for all benchmarks because I don't want to measure cache performance 
of OS or of the DB to benchmark.

Therefore all requests (except reaing twice or more but that shouldn't be 
the case and would also be fine as cache hit) are from disk and not from 
the cache.

There is belief here that multi-block I/O was introduced prior to OS
doing this as a standard mechanism. Linux expands its read ahead window
in response to sequential scans and so this seems like something we
don't want to do in the database.

I played even with large values on block device readaheads of /dev/md*, 
/dev/sd* and /dev/dm-* as well as stripe_cache_size of /dev/md* but 
without any performance improvements in the benmark scenarios.

=> All readaheads/caches don't seem to work in at least in the HEAP 
BITMAP SCAN scenarios on nearly latest Linux kernels.

But I think such block issues (reading in largest blocks as possible) have 
to be optimized on application level (in our case DB level) because
1.) We can't assume that OS and even storage works well in such scenarios
2.) We can't assume that OS/storage is intelligent enough to reduce number 
of IOPS when 2 random blocks are at random 2 sequential blocks and that 
therefore the number of IOPS is reduced.
3.) I think such a logic should be very easy to integrate and even has 
been done with some patches.

It's possible this is wrong. Is the table being scanned fairly sizable
and was it allocated contiguously? i.e. was it a large table loaded via
COPY?

I also wonder if more L2 cache effects exist.

What do you mean with "table being scanned fairly sizable"? I don't get 
it.

Table was filled with a lot of inserts but at one time point.

Ciao,
Gerhard

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general