Re: Anything to be gained from a 'Postgres Filesystem'?

Pierre-Frédéric Caillaud <lists@xxxxxxxxxxxxxxxxxxxxx> · Thu, 04 Nov 2004 13:29:19 +0100

posix_fadvise(2) may be a candidate. Read/Write bareers another pone, as
well asn syncing a bunch of data in different files with a single call
(so that the OS can determine the best write order). I can also imagine
some interaction with the FS journalling system (to avoid duplicate
efforts).

	There is also the fact that syncing after every transaction could be  
changed to syncing every N transactions (N fixed or depending on the data  
size written by the transactions) which would be more efficient than the  
current behaviour with a sleep. HOWEVER suppressing the sleep() would lead  
to postgres returning from the COMMIT while it is in fact not synced,  
which somehow rings a huge alarm bell somewhere.

	What about read order ?
	This could be very useful for SELECT queries involving indexes, which in  
case of a non-clustered table lead to random seeks in the table.
	There's fadvise to tell the OS to readahead on a seq scan (I think the OS  
detects it anyway), but if there was a system call telling the OS "in the  
next seconds I'm going to read these chunks of data from this file (gives  
a list of offsets and lengths), could you put them in your cache in the  
most efficient order without seeking too much, so that when I read() them  
in random order, they will be in the cache already ?". This would be an  
asynchronous call which would return immediately, just queuing up the data  
somewhere in the kernel, and maybe sending a signal to the application  
when a certain percentage of the data has been cached.
	PG could take advantage of this with not much code changes, simply by  
putting a fifo between the index scan and the tuple fetches, to wait the  
time necessary for the OS to have enough reads to cluster them efficiently.
	On very large tables this would maybe not gain much, but on tables which  
are explicitely clustered, or naturally clustered like accessing an index  
on a serial primary key in order, it could be interesting.

	Just a thought.