On 08/07/2009, at 8:39 PM, Alban Hertroys wrote:
On Jul 8, 2009, at 2:50 AM, Yaroslav Tykhiy wrote:
Hi All,
I have a mid-size database (~300G) used as an email store and
running on a FreeBSD + ZFS combo. Its PG_DATA is on ZFS whilst
xlog goes to a different FFS disk. ZFS prefetch was enabled by
default and disk time on PG_DATA was near 100% all the time with
transfer rates heavily biased to read: ~50-100M/s read vs ~2-5M/s
write. A former researcher, I was going to set up disk performance
monitoring to collect some history and see if disabling prefetch
would have any effect, but today I had to find out the difference
the hard way. Sorry, but that's why the numbers I can provide are
quite approximate.
Due to a peak in user activity the server just melted down, with
mail data queries taking minutes to execute. As the last resort, I
rebooted the server with ZFS prefetch disabled -- it couldn't be
disabled at run time in FreeBSD. Now IMAP feels much more
responsive; transfer rates on PG_DATA are mostly <10M/s read and
1-2M/s write; and disk time stays way below 100% unless a bunch of
email is being inserted.
My conclusion is that although ZFS prefetch is supposed to be
adaptive and handle random access more or less OK, in reality there
is plenty of room for improvement, so to speak, and for now
Postgresql performance can benefit from its staying just disabled.
The same may apply to other database systems as well.
Are you sure you weren't hitting swap?
A sceptic myself, I genuinely understand your doubt. But this time I
was sure because I paid attention to the name of the device involved.
Moreover, a thrashing system wouldn't have had such a disparity
between disk read and write rates.
IIRC prefetch tries to keep data (disk blocks?) in memory that it
fetched recently.
What you described is just a disk cache. And a trivial implementation
of prefetch would work as follows: An application or other file/disk
consumer asks the provider (driver, kernel, whatever) to read, say, 2
disk blocks worth of data. The provider thinks, "I know you are short-
sighted; I bet you are going to ask for more contiguous blocks very
soon," so it schedules a disk read for many more contiguous blocks
than requested and caches them in RAM. For bulk data applications
such as file serving this trick works as a charm. But other
applications do truly random access and they never come back after the
prefetched blocks; in this case both disk bandwidth and cache space
are wasted. An advanced implementation can try to distinguish
sequential and random access patterns, but in reality it appears to be
a challenging task.
ZFS uses quite a bit of memory, so if you distributed all your
memory to be used by just postgres and disk cache then you didn't
leave enough space for the prefetch data and _something_ will be
moved to swap.
I hope you know that FreeBSD is exceptionally good at distributing
available memory between its consumers. That said, useless prefetch
indeed puts extra pressure on disk cache and results in unnecessary
cache evictions, thus making things even worse. It is true that ZFS
is memory hungry and so rather sensitive to non-optimal memory use
patterns. Useless prefetch wastes memory that could be used to speed
up other ZFS operations.
If you're running FreeBSD i386 then ZFS requires some careful tuning
due to the limits a 32-bit OS puts on memory. I recall ZFS not being
very stable on i386 a while ago for those reasons, which has by now
been fixed as far as possible, but it's not ideal (and it likely
never will be).
I use FreeBSD/amd64 and I'm generally happy with ZFS on that platform.
You'll probably want to ask about this on the FreeBSD mailing lists
as well, they'll know much better than I do ;)
Are you a local FreeBSD expert? ;-) Jokes apart, I don't think this
topic has to do with FreeBSD as such; it is mostly about making the
advanced technologies of Postgresql and ZFS go well together. Even
ZFS developers admit that in database related applications exceptions
from general ZFS practices and rules may be called for.
When I set up my next ZFS based Postgresql server, I think I'll play
with the recordsize property of ZFS and see if setting it to PAGESIZE
makes any difference.
Thanks,
Yar
--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general