posix_fadvise(2) may be a candidate. Read/Write bareers another pone, as
well asn syncing a bunch of data in different files with a single call
(so that the OS can determine the best write order). I can also imagine
some interaction with the FS journalling system (to avoid duplicate
efforts).
There is also the fact that syncing after every transaction could be
changed to syncing every N transactions (N fixed or depending on the data
size written by the transactions) which would be more efficient than the
current behaviour with a sleep. HOWEVER suppressing the sleep() would lead
to postgres returning from the COMMIT while it is in fact not synced,
which somehow rings a huge alarm bell somewhere.
What about read order ?
This could be very useful for SELECT queries involving indexes, which in
case of a non-clustered table lead to random seeks in the table.
There's fadvise to tell the OS to readahead on a seq scan (I think the OS
detects it anyway), but if there was a system call telling the OS "in the
next seconds I'm going to read these chunks of data from this file (gives
a list of offsets and lengths), could you put them in your cache in the
most efficient order without seeking too much, so that when I read() them
in random order, they will be in the cache already ?". This would be an
asynchronous call which would return immediately, just queuing up the data
somewhere in the kernel, and maybe sending a signal to the application
when a certain percentage of the data has been cached.
PG could take advantage of this with not much code changes, simply by
putting a fifo between the index scan and the tuple fetches, to wait the
time necessary for the OS to have enough reads to cluster them efficiently.
On very large tables this would maybe not gain much, but on tables which
are explicitely clustered, or naturally clustered like accessing an index
on a serial primary key in order, it could be interesting.
Just a thought.