Re: PostgreSQL reads each 8k block - no larger blocks are used - even on sequential scans

Gerhard Wiesinger <lists@xxxxxxxxxxxxx> · Sat, 3 Oct 2009 09:11:12 +0200 (CEST)

On Fri, 2 Oct 2009, Greg Smith wrote:

On Fri, 2 Oct 2009, Gerhard Wiesinger wrote:

Larger blocksizes also reduce IOPS (I/Os per second) which might be a 
critial threshold on storage systems (e.g. Fibre Channel systems).

True to some extent, but don't forget that IOPS is always relative to a block 
size in the first place.  If you're getting 200 IOPS with 8K blocks, 
increasing your block size to 128K will not result in your getting 200 IOPS 
at that larger size; the IOPS number at the larger block size is going to 
drop too.  And you'll pay the penalty for that IOPS number dropping every 
time you're accessing something that would have only been an 8K bit of I/O 
before.

Yes, there will be some (very small) drop in IOPS, when blocksize is 
higher but today disks have a lot of throughput when IOPS*128k are 
compared to e.g. 100MB/s. I've done some Excel calculations which support 
this.

The trade-off is very application dependent.  The position you're advocating, 
preferring larger blocks, only makes sense if your workload consists mainly 
of larger scans.  Someone who is pulling scattered records from throughout a 
larger table will suffer with that same change, because they'll be reading a 
minimum of 128K even if all they really needed with a few bytes.  That 
penalty ripples all the way from the disk I/O upwards through the buffer 
cache.

I wouldn't read 128k blocks all the time. I would do the following:
When e.g. B0, B127, B256 should be read I would read in 8k random block 
I/O.

When B1, B2, B3, B4, B5, B7, B8, B9, B10 are needed I would make 2 
requests with the largest possible blocksize:
1.) B1-B5: 5*8k=40k
2.) B7-B10: 4*8k=32k

In this case when B5 and B7 are only one block away we could also discuss 
whether we should read B1-B10=10*8k=80k in one read request and don't use 
B6.

That would reduce the IOPS of a factor of 4-5 in that scenario and 
therefore throughput would go up.

It's easy to generate a synthetic benchmark workload that models some 
real-world applications and see performance plunge with a larger block size. 
There certainly are others where a larger block would work better. Testing 
either way is complicated by the way RAID devices usually have their own 
stripe sizes to consider on top of the database block size.

Yes, there are block device read ahead buffers and also RAID stripe 
caches. But both don't seem to work well with the tested HEAP BITMAP SCAN 
scenario and also in practical PostgreSQL performance measurement 
scenarios.

But the modelled pgiosim isn't a synthetic benchmark it is the same as a 
real work HEAP BITMAP SCAN scenario in PostgreSQL where some blocks are 
read directly consecutive at least logically in the filesystem (and with 
some propability also physically on disk) but currently only with each 8k 
block read even when 2 or more blocks could be read with one request.

BTW: I would also limit the blocksize to some upper limit on such requests 
(e.g. 1MB).

Ciao,
Gerhard

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general