Re: PostgreSQL reads each 8k block - no larger blocks are used - even on sequential scans

Gerhard Wiesinger <lists@xxxxxxxxxxxxx> · Fri, 2 Oct 2009 20:46:02 +0200 (CEST)

On Fri, 2 Oct 2009, Greg Smith wrote:

On Sun, 27 Sep 2009, Gerhard Wiesinger wrote:

I think this is one of the most critical performance showstopper of 
PostgreSQL on the I/O side.

I wish, this is an easy problem compared to the real important ones that need 
to be resolved.  Situations where the OS is capable of faster sequential I/O 
performance than PostgreSQL appears to deliver doing reads are often caused 
by something other than what the person doing said benchmarking believes they 
are.  For example, the last time I thought I had a smoking gun situation just 
like the one you're describing, it turns out the background operation I 
didn't know was going on that slowed things down were hint bit updates: 
http://wiki.postgresql.org/wiki/Hint_Bits

Background checkpoints can also cause this, typically if you set 
checkpoint_segments really high and watch when they're happening you can 
avoid that interfering with results too.

It's hard to isolate out the cause of issues like this.  Since most people 
seem to get something close to real disk speed from sequential scans when 
measured properly, I would suggest starting with the assumption there's 
something wrong with your test case rather than PostgreSQL.  The best way to 
do that is to construct a test case others can run that shows the same 
problem on other systems using the database itself.  The easiest way to build 
one of those is using generate_series to create some bogus test data, SELECT 
everything in there with \timing on, and then use the size of the relation on 
disk to estimate MB/s.

Regardless, it's easy enough to build PostgreSQL with larger block sizes if 
you think that really matters for your situation.  You're never going to see 
that in the mainstream version though, because there are plenty of downsides 
to using larger blocks.  And since the database doesn't actually know where 
on disk things are at, it's not really in a good position to make decisions 
about I/O scheduling anyway.  More on that below.

What's the current status of the patch of Gregory Stark? Any timeframes to 
integrate?

There needs to be a fairly major rearchitecting of how PostgreSQL handles 
incoming disk I/O for that to go anywhere else, and I don't believe that's 
expected to be ready in the near future.

Does it also work for sequence scans? Any plans for a generic "multi block 
read count" solution?

There was a similar patch for sequential scans submitted by someone else 
based on that work.  It was claimed to help performance on a Linux system 
with a rather poor disk I/O setup.  No one else was able to replicate any 
performance improvement using the patch though.  As far as I've been able to 
tell, the read-ahead logic being done by the Linux kernel and in some 
hardware is already doing this sort of optimization for you on that OS, 
whether or not your app knows enough to recognize it's sequentially scanning 
the disk it's working against.

I've enhanced the pgiosim project http://pgfoundry.org/projects/pgiosim/ 
with a patch for larger blocksizes independent from PostgreSQL: http://www.wiesinger.com/opensource/pgiosim/pgiosim-0.2-blocksizes.diff

You'll find some detailed results below and can verify this on your 
platforms with the patch above. Maybe someone can verify this on different 
HW/SW plattforms. If you have any questions regarding the pgiosim and the 
patch just feel free to ask.

Summary:
RANDOM I/O of blocksizes of e.g. 128k (e.g. BITMAP HEAP SCAN) has better 
performance than reading the same blocks with 8k block sizes (factor 1.5).

Conclusio:
In the test scenario the proposed solution would have a performance gain 
of a factor of 1.5 for typical BITMAP HEAP SCANS. For other scenarios no 
performance gain with larger block sizes of continuous blocks could be 
measured. Therefore I'm assuming that prefetching works well on Linux with 
sequential I/O but not with random I/O.

I hope I can convince someone that such optimizations make sense as 
commercial database venders have implemented such features for performance 
reasons.

BTW: Prefetch is enabled on the raid and blockdevices.

Ciao,
Gerhard

--
http://www.wiesinger.com/

# RANDOM I/O 8k blocksize
echo 3 > /proc/sys/vm/drop_caches;./pgiosim -b 10000 test.txt
Arg: 1
Added test.txt
blocksize=8192, reading block as a whole
Elapsed: 135.92
Read 10000 blocks Wrote 0 blocks
73.57 op/sec, 588.60kB/sec

# RANDOM I/O 8k blocksize (for verification only), in fact same test as below
echo 3 > /proc/sys/vm/drop_caches;./pgiosim -b 10000 -r test.txt
Arg: 1
Added test.txt
blocksize=8192, doing single read requests with chunk size of 8192 bytes
Elapsed: 136.30
Read 10000 blocks Wrote 0 blocks
73.37 op/sec, 586.94kB/sec

# RANDOM I/O 128k blocksize, read as one 128k block
echo 3 > /proc/sys/vm/drop_caches;./pgiosim -b 10000 -o 131072 test.txt
Arg: 1
Added test.txt
blocksize=131072, reading block as a whole
Elapsed: 160.56
Read 10000 blocks Wrote 0 blocks
62.28 op/sec, 7972.15kB/sec

# RANDOM I/O 128k blocksize, read as multiple 8k blocks sequentially
echo 3 > /proc/sys/vm/drop_caches;./pgiosim -b 10000 -o 131072 -r test.txt
Arg: 1
Added test.txt
blocksize=131072, doing single read requests with chunk size of 8192 bytes
Elapsed: 245.69
Read 10000 blocks Wrote 0 blocks
651.20 op/sec, 5209.81kB/sec

# SEQUENTIAL I/O 128k blocksize, read as multiple 128k blocks
echo 3 > /proc/sys/vm/drop_caches;./pgiosim -s -b 100000 -o 131072 
test.txt
Seq Scan
Arg: 1
Added test.txt
blocksize=131072, reading block as a whole
Elapsed: 112.78
Read 100000 blocks Wrote 0 blocks
886.70 op/sec, 113497.67kB/sec

# SEQUENTIAL I/O 128k blocksize, read as multiple 8k blocks sequentially
echo 3 > /proc/sys/vm/drop_caches;./pgiosim -s -b 100000 -o 131072 -r 
test.txt
Seq Scan
Arg: 1
Added test.txt
blocksize=131072, doing single read requests with chunk size of 8192 bytes
Elapsed: 112.45
Read 100000 blocks Wrote 0 blocks
14228.81 op/sec, 113830.49kB/sec

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general