Thanks Dimitri! That was very educational material! I'm going to think out loud here, so please correct me if you see any errors. The section on tuning for OLTP transactions was interesting, although my OLAP workload will be predominantly bulk I/O over large datasets of mostly-sequential blocks. The NFS+ZFS section talked about the zil_disable control for making zfs ignore commits/fsyncs. Given that Postgres' executor does single-threaded synchronous I/O like the tar example, it seems like it might benefit significantly from setting zil_disable=1, at least in the case of frequently flushed/committed writes. However, zil_disable=1 sounds unsafe for the datafiles' filesystem, and would probably only be acceptible for the xlogs if they're stored on a separate filesystem and you're willing to loose recently committed transactions. This sounds pretty similar to just setting fsync=off in postgresql.conf, which is easier to change later, so I'll skip the zil_disable control. The RAID-Z section was a little surprising. It made RAID-Z sound just like RAID 50, in that you can customize the trade-off between iops versus usable diskspace and fault-tolerance by adjusting the number/size of parity-protected disk groups. The only difference I noticed was that RAID-Z will apparently set the stripe size across vdevs (RAID-5s) to be as close as possible to the filesystem's block size, to maximize the number of disks involved in concurrently fetching each block. Does that sound about right? So now I'm wondering what RAID-Z offers that RAID-50 doesn't. I came up with 2 things: an alleged affinity for full-stripe writes and (under RAID-Z2) the added fault-tolerance of RAID-6's 2nd parity bit (allowing 2 disks to fail per zpool). It wasn't mentioned in this blog, but I've heard that under certain circumstances, RAID-Z will magically decide to mirror a block instead of calculating parity on it. I'm not sure how this would happen, and I don't know the circumstances that would trigger this behavior, but I think the goal (if it really happens) is to avoid the performance penalty of having to read the rest of the stripe required to calculate parity. As far as I know, this is only an issue affecting small writes (e.g. single-row updates in an OLTP workload), but not large writes (compared to the RAID's stripe size). Anyway, when I saw the filesystem's intent log mentioned, I thought maybe the small writes are converted to full-stripe writes by deferring their commit until a full stripe's worth of data had been accumulated. Does that sound plausible? Are there any other noteworthy perks to RAID-Z, rather than RAID-50? If not, I'm inclined to go with your suggestion, Dimitri, and use zfs like RAID-10 to stripe a zpool over a bunch of RAID-1 vdevs. Even though many of our queries do mostly sequential I/O, getting higher seeks/second is more important to us than the sacrificed diskspace. For the record, those blogs also included a link to a very helpful ZFS Best Practices Guide: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide To sum up, so far the short list of tuning suggestions for ZFS includes: - Use a separate zpool and filesystem for xlogs if your apps write often. - Consider setting zil_disable=1 on the xlogs' dedicated filesystem. ZIL is the intent log, and it sounds like disabling it may be like disabling journaling. Previous message threads in the Postgres archives debate whether this is safe for the xlogs, but it didn't seem like a conclusive answer was reached. - Make filesystem block size (zfs record size) match the Postgres block size. - Manually adjust vdev_cache. I think this sets the read-ahead size. It defaults to 64 KB. For OLTP workload, reduce it; for DW/OLAP maybe increase it. - Test various settings for vq_max_pending (until zfs can auto-tune it). See http://blogs.sun.com/erickustarz/entry/vq_max_pending - A zpool of mirrored disks should support more seeks/second than RAID-Z, just like RAID 10 vs. RAID 50. However, no single Postgres backend will see better than a single disk's seek rate, because the executor currently dispatches only 1 logical I/O request at a time. >>> Dimitri <dimitrik.fr@xxxxxxxxx> 03/23/07 2:28 AM >>> On Friday 23 March 2007 03:20, Matt Smiley wrote: > My company is purchasing a Sunfire x4500 to run our most I/O-bound > databases, and I'd like to get some advice on configuration and tuning. > We're currently looking at: - Solaris 10 + zfs + RAID Z > - CentOS 4 + xfs + RAID 10 > - CentOS 4 + ext3 + RAID 10 > but we're open to other suggestions. > Matt, for Solaris + ZFS you may find answers to all your questions here: http://blogs.sun.com/roch/category/ZFS http://blogs.sun.com/realneel/entry/zfs_and_databases Think to measure log (WAL) activity and use separated pool for logs if needed. Also, RAID-Z is more security-oriented rather performance, RAID-10 should be a better choice... Rgds, -Dimitri