Re: ZFS prefetch considered evil?

Alban Hertroys <dalroi@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> · Thu, 9 Jul 2009 14:37:38 +0200

On Jul 9, 2009, at 3:53 AM, Yaroslav Tykhiy wrote:

On 08/07/2009, at 8:39 PM, Alban Hertroys wrote:

On Jul 8, 2009, at 2:50 AM, Yaroslav Tykhiy wrote:
IIRC prefetch tries to keep data (disk blocks?) in memory that it  
fetched recently.

What you described is just a disk cache.  And a trivial  
implementation of prefetch would work as follows:  An application or  
other file/disk consumer asks the provider (driver, kernel,  
whatever) to read, say, 2 disk blocks worth of data.  The provider  
thinks, "I know you are short-sighted; I bet you are going to ask  
for more contiguous blocks very soon," so it schedules a disk read  
for many more contiguous blocks than requested and caches them in  
RAM.  For bulk data applications such as file serving this trick  
works as a charm.  But other applications do truly random access and  
they never come back after the prefetched blocks; in this case both  
disk bandwidth and cache space are wasted.  An advanced  
implementation can try to distinguish sequential and random access  
patterns, but in reality it appears to be a challenging task.

Ah yes, thanks for the correction, I now remember reading about that  
before. Makes the name 'prefetch' that more fitting, doesn't it?

And as you say, it's not that useful a feature with random access  
(hadn't thought about that); in fact, I can imagine that it might  
delay moving the disk-heads to the next desired (random) position as  
the FS is still requesting data that it isn't going to be needing  
(except for some lucky cases) - unless it manages to detect the  
randomness of the access patterns. You can't predict randomness from  
just read requests of course, you don't know about the requests that  
are still to come. You can however assume something like that is the  
case if historic requests turned out to be random by nature, but then  
you'd want to know for which area of the FS this is the case.

I don't know how you partitioned your zpools, but to me it seems like  
it'd be preferable to have the PostgreSQL tablespaces (and possibly  
other data that's likely to be accessed randomly) in a separate zpool  
from the rest of the system so you can restrict disabling prefetch to  
just that file-system. You probably already did that...

It could be interesting to see how clustering the relevant tables  
would affect the prefetch performance, I'd expect disk access to be  
less random that way. It's probably still better to disable prefetch  
though.

ZFS uses quite a bit of memory, so if you distributed all your  
memory to be used by just postgres and disk cache then you didn't  
leave enough space for the prefetch data and _something_ will be  
moved to swap.

I hope you know that FreeBSD is exceptionally good at distributing  
available memory between its consumers.  That said, useless prefetch  
indeed puts extra pressure on disk cache and results in unnecessary  
cache evictions, thus making things even worse.  It is true that ZFS  
is memory hungry and so rather sensitive to non-optimal memory use  
patterns.  Useless prefetch wastes memory that could be used to  
speed up other ZFS operations.

Yes, I do know that, it's one of the reasons I prefer it over other  
OSs. The keyword here was 'available memory' though, under the  
assumption that something was hitting swap. But apparently that wasn't  
the case.

You'll probably want to ask about this on the FreeBSD mailing lists  
as well, they'll know much better than I do ;)

Are you a local FreeBSD expert? ;-)  Jokes apart, I don't think this  
topic has to do with FreeBSD as such; it is mostly about making the  
advanced technologies of Postgresql and ZFS go well together.  Even  
ZFS developers admit that in database related applications  
exceptions from general ZFS practices and rules may be called for.

I wouldn't call myself an expert, I just use it on a few systems at  
home and am more a user than an administrator. I do read the stable/ 
current mailing lists though (since 2004 according to my mail client)  
and keep an eye on (among others) the ZFS discussions as I feel  
tempted to change my gmirrors into zpools some day. It certainly looks  
like an interesting FS, very flexible and reliable.

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

!DSPAM:737,4a55e49a10131296212767!

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general