Re: Postgresql 9.4 and ZFS?

Tomas Vondra <tomas.vondra@xxxxxxxxxxxxxxx> · Wed, 30 Sep 2015 21:58:08 +0200

On 09/30/2015 07:33 PM, Benjamin Smith wrote:
On Wednesday, September 30, 2015 02:22:31 PM Tomas Vondra wrote:
I think this really depends on the workload - if you have a lot of
random writes, CoW filesystems will perform significantly worse than
e.g. EXT4 or XFS, even on SSD.

I'd be curious about the information you have that leads you to this
conclusion. As with many (most?) "rules of thumb", the devil is
quiteoften the details.

A lot of testing done recently, and also experience with other CoW 
filesystems (e.g. BTRFS explicitly warns about workloads with a lot of 
random writes).

We've been running both on ZFS/CentOS 6 with excellent results, and
are considering putting the two together. In particular, the CoW
nature (and subsequent fragmentation/thrashing) of ZFS becomes
largely irrelevant on SSDs; the very act of wear leveling on an SSD
is itself a form of intentional thrashing that doesn't affect
performance since SSDs have no meaningful seek time.

I don't think that's entirely true. Sure, SSD drives handle random I/O
much better than rotational storage, but it's not entirely free and
sequential I/O is still measurably faster.

It's true that the drives do internal wear leveling, but it probably
uses tricks that are impossible to do at the filesystem level (which is
oblivious to internal details of the SSD). CoW also increases the amount
of blocks that need to be reclaimed.

In the benchmarks I've recently done on SSD, EXT4 / XFS are ~2x
faster than ZFS. But of course, if the ZFS features are interesting
for you, maybe it's a reasonable price.

Again, the details would be highly interesting to me. What memory
optimization was done? Status of snapshots? Was the pool RAIDZ or
mirrored vdevs? How many vdevs? Was compression enabled? What ZFS
release was this? Was this on Linux,Free/Open/Net BSD, Solaris, or
something else?

I'm not sure what you mean by "memory optimization" so the answer is 
probably "no".

FWIW I don't have much experience with ZFS in production, all I have is 
data from benchmarks I've recently done exactly with the goal to educate 
myself on the differences of current filesystems.

The tests were done on Linux, with kernel 4.0.4 / zfs 0.6.4. So fairly 
recent versions, IMHO.

My goal was to test the file systems under the same conditions and used 
a single device (Intel S3700 SSD). I'm aware that this is not a perfect 
test and ZFS offers interesting options (e.g. moving ZIL to a separate 
device). I plan to benchmark some additional configurations with more 
devices and such.

A 2x performance difference is almost inconsequential in my
experience, where growth is exponential. 2x performance change
generally means 1 to 2 years of advancement or deferment against the
progression of hardware; our current, relatively beefy DB servers
are already older than that, and have an anticipated life cycle of at
leastanother couple years.

I'm not sure I understand what you suggest here. What I'm saying is that 
when I do a stress test on the same hardware, I do get ~2x the 
throughput with EXT4/XFS, compared to ZFS.

// Our situation // Lots of RAM for the workload: 128 GB of ECC RAM
with an on-disk DB size of ~ 150 GB. Pretty much, everything runs
straight out of RAM cache, with only writes hitting disk. Smart
reports 4/96 read/write ratio.

So your active set fits into RAM? I'd guess all your writes are then WAL 
+ checkpoints, which probably makes them rather sequential.

If that's the case, CoW filesystems may perform quite well - I was 
mostly referring to workloads with a lot of random writes to he device.

Query load: Constant, heavy writes and heavy use of temp tables in
order to assemble very complex queries. Pretty much the "worst case"
mix of reads and writes, average daily peak of about 200-250
> queries/second.

I'm not sure how much random I/O that actually translates to. According 
to the numbers I've posted to this thread few hours ago, a tuned ZFS on 
a single SSD device handles ~2.5k tps (with dataset ~2x the RAM). But 
those are OLTP queries - your queries may write much more data. OTOH it 
really does not matter that much if your active set fits into RAM, 
because then it's mostly about writing to ZIL.

16 Core XEON servers, 32 HT "cores".

SAS 3 Gbps

CentOS 6 is our O/S of choice.

Currently, we're running Intel 710 SSDs in a software RAID1 without
trim enabled and generally happy with the reliability and performance
we see. We're planning to upgrade storage soon (since we're over 50%
utilization) and in the process, bring the magic goodness of
snapshots/clones from ZFS.

I presume by "software RAID1" you mean "mirrored vdev zpool", correct?

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general