Re: PostgreSQL reads each 8k block - no larger blocks are used - even on sequential scans

Greg Smith <gsmith@xxxxxxxxxxxxx> · Fri, 9 Oct 2009 07:02:13 -0400 (EDT)

On Sat, 3 Oct 2009, Gerhard Wiesinger wrote:

I wouldn't read 128k blocks all the time. I would do the following:
When e.g. B0, B127, B256 should be read I would read in 8k random block I/O.

When B1, B2, B3, B4, B5, B7, B8, B9, B10 are needed I would make 2 requests 
with the largest possible blocksize:
1.) B1-B5: 5*8k=40k
2.) B7-B10: 4*8k=32k

I see what you mean now.  This is impossible in the current buffer manager 
implementation because blocks are requested one at a time, and there are 
few situations where you can predict which are going to be wanted next. 
The hash index and sequential scan are two that were possible to predict 
in that way.

The fadvise patches already committed didn't change the way blocks were 
read in, they just used knowledge about what was coming next to advise the 
OS.  That's quite a bit different from actually asking for things in 
larger chunks and only using what you need.

Implementing larger chunking reads or similar asynchronous batch I/O is a 
big project, because you'd have to rearchitect the whole way buffers are 
managed in the database to do it right.  Greg Stark's earliest proof of 
concept prototype for async I/O included a Solaris implementation that 
used the AIO library.  It wasn't feasible to actually use that underlying 
implemention in the database in the end though, because the model AIO uses 
expects you'll fire off a bunch of I/O and then retrieve blocks as they 
come in.  That's really not easy to align with the model for how blocks 
are read into shared_buffers right now.  He had some ideas for that and 
I've thought briefly about the problem, but it would be a major overhaul 
to some scary to touch database internals to pull off.

Given that the OS and/or RAID implementations tend to do what we want in a 
lot of these cases, where smarter/chunkier read-ahead is what we you need, 
the payback on accelerating those cases hasn't been perceived as that 
great.  There is a major win for the hash index reads, which Solaris 
systems can't take advantage of, so somebody who uses those heavily on 
that OS might be motivated enough produce improvements for that use case. 
Once the buffer cache at large understood how to handle batching async 
reads, Solaris AIO would be possible, fancier stuff with Linux AIO would 
be possible, and the type of chunking reads you're suggesting would be 
too.  But none of that is happening without some major rearchitecting 
first.

Unfortunately there aren't that many people with the right knowledge and 
motivation to start tinkering around with the buffer cache internals to 
the extent that would be required to do better here, and pretty much of 
them I'm aware of are hacking on projects with a much clearer payback 
instead.

--
* Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general