Re: ZFS prefetch considered evil?

Yaroslav Tykhiy <yar@xxxxxxxxxxxxx> · Thu, 9 Jul 2009 11:53:12 +1000

On 08/07/2009, at 8:39 PM, Alban Hertroys wrote:

On Jul 8, 2009, at 2:50 AM, Yaroslav Tykhiy wrote:

Hi All,

I have a mid-size database (~300G) used as an email store and  
running on a FreeBSD + ZFS combo.  Its PG_DATA is on ZFS whilst  
xlog goes to a different FFS disk.  ZFS prefetch was enabled by  
default and disk time on PG_DATA was near 100% all the time with  
transfer rates heavily biased to read: ~50-100M/s read vs ~2-5M/s  
write.  A former researcher, I was going to set up disk performance  
monitoring to collect some history and see if disabling prefetch  
would have any effect, but today I had to find out the difference  
the hard way.  Sorry, but that's why the numbers I can provide are  
quite approximate.

Due to a peak in user activity the server just melted down, with  
mail data queries taking minutes to execute.  As the last resort, I  
rebooted the server with ZFS prefetch disabled -- it couldn't be  
disabled at run time in FreeBSD.  Now IMAP feels much more  
responsive; transfer rates on PG_DATA are mostly <10M/s read and  
1-2M/s write; and disk time stays way below 100% unless a bunch of  
email is being inserted.

My conclusion is that although ZFS prefetch is supposed to be  
adaptive and handle random access more or less OK, in reality there  
is plenty of room for improvement, so to speak, and for now  
Postgresql performance can benefit from its staying just disabled.   
The same may apply to other database systems as well.

Are you sure you weren't hitting swap?

A sceptic myself, I genuinely understand your doubt.  But this time I  
was sure because I paid attention to the name of the device involved.   
Moreover, a thrashing system wouldn't have had such a disparity  
between disk read and write rates.

IIRC prefetch tries to keep data (disk blocks?) in memory that it  
fetched recently.

What you described is just a disk cache.  And a trivial implementation  
of prefetch would work as follows:  An application or other file/disk  
consumer asks the provider (driver, kernel, whatever) to read, say, 2  
disk blocks worth of data.  The provider thinks, "I know you are short- 
sighted; I bet you are going to ask for more contiguous blocks very  
soon," so it schedules a disk read for many more contiguous blocks  
than requested and caches them in RAM.  For bulk data applications  
such as file serving this trick works as a charm.  But other  
applications do truly random access and they never come back after the  
prefetched blocks; in this case both disk bandwidth and cache space  
are wasted.  An advanced implementation can try to distinguish  
sequential and random access patterns, but in reality it appears to be  
a challenging task.

ZFS uses quite a bit of memory, so if you distributed all your  
memory to be used by just postgres and disk cache then you didn't  
leave enough space for the prefetch data and _something_ will be  
moved to swap.

I hope you know that FreeBSD is exceptionally good at distributing  
available memory between its consumers.  That said, useless prefetch  
indeed puts extra pressure on disk cache and results in unnecessary  
cache evictions, thus making things even worse.  It is true that ZFS  
is memory hungry and so rather sensitive to non-optimal memory use  
patterns.  Useless prefetch wastes memory that could be used to speed  
up other ZFS operations.

If you're running FreeBSD i386 then ZFS requires some careful tuning  
due to the limits a 32-bit OS puts on memory. I recall ZFS not being  
very stable on i386 a while ago for those reasons, which has by now  
been fixed as far as possible, but it's not ideal (and it likely  
never will be).

I use FreeBSD/amd64 and I'm generally happy with ZFS on that platform.

You'll probably want to ask about this on the FreeBSD mailing lists  
as well, they'll know much better than I do ;)

Are you a local FreeBSD expert? ;-)  Jokes apart, I don't think this  
topic has to do with FreeBSD as such; it is mostly about making the  
advanced technologies of Postgresql and ZFS go well together.  Even  
ZFS developers admit that in database related applications exceptions  
from general ZFS practices and rules may be called for.

When I set up my next ZFS based Postgresql server, I think I'll play  
with the recordsize property of ZFS and see if setting it to PAGESIZE  
makes any difference.

Thanks,

Yar

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general