Re: Postgresql 9.4 and ZFS?

Benjamin Smith <lists@xxxxxxxxxxxxxxxxxx> · Wed, 30 Sep 2015 14:12:55 -0700

On Wednesday, September 30, 2015 09:58:08 PM Tomas Vondra wrote:
> On 09/30/2015 07:33 PM, Benjamin Smith wrote:
> > On Wednesday, September 30, 2015 02:22:31 PM Tomas Vondra wrote:
> >> I think this really depends on the workload - if you have a lot of
> >> random writes, CoW filesystems will perform significantly worse than
> >> e.g. EXT4 or XFS, even on SSD.
> > 
> > I'd be curious about the information you have that leads you to this
> > conclusion. As with many (most?) "rules of thumb", the devil is
> > quiteoften the details.
> 
> A lot of testing done recently, and also experience with other CoW
> filesystems (e.g. BTRFS explicitly warns about workloads with a lot of
> random writes).
> 
> >>> We've been running both on ZFS/CentOS 6 with excellent results, and
> >>> are considering putting the two together. In particular, the CoW
> >>> nature (and subsequent fragmentation/thrashing) of ZFS becomes
> >>> largely irrelevant on SSDs; the very act of wear leveling on an SSD
> >>> is itself a form of intentional thrashing that doesn't affect
> >>> performance since SSDs have no meaningful seek time.
> >> 
> >> I don't think that's entirely true. Sure, SSD drives handle random I/O
> >> much better than rotational storage, but it's not entirely free and
> >> sequential I/O is still measurably faster.
> >> 
> >> It's true that the drives do internal wear leveling, but it probably
> >> uses tricks that are impossible to do at the filesystem level (which is
> >> oblivious to internal details of the SSD). CoW also increases the amount
> >> of blocks that need to be reclaimed.
> >> 
> >> In the benchmarks I've recently done on SSD, EXT4 / XFS are ~2x
> >> faster than ZFS. But of course, if the ZFS features are interesting
> >> for you, maybe it's a reasonable price.
> > 
> > Again, the details would be highly interesting to me. What memory
> > optimization was done? Status of snapshots? Was the pool RAIDZ or
> > mirrored vdevs? How many vdevs? Was compression enabled? What ZFS
> > release was this? Was this on Linux,Free/Open/Net BSD, Solaris, or
> > something else?
> 
> I'm not sure what you mean by "memory optimization" so the answer is
> probably "no".

I mean the full gamut: 

Did you use an l2arc? Did you use a dedicated ZIL? What was arc_max set to? 
How much RAM/GB was installed on the machine? How did you set up PG? (PG 
defaults are historically horrible for higher-RAM machines) 

> FWIW I don't have much experience with ZFS in production, all I have is
> data from benchmarks I've recently done exactly with the goal to educate
> myself on the differences of current filesystems.
> 
> The tests were done on Linux, with kernel 4.0.4 / zfs 0.6.4. So fairly
> recent versions, IMHO.
> 
> My goal was to test the file systems under the same conditions and used
> a single device (Intel S3700 SSD). I'm aware that this is not a perfect
> test and ZFS offers interesting options (e.g. moving ZIL to a separate
> device). I plan to benchmark some additional configurations with more
> devices and such.

Also, did you try with/without compression? My information so far is that 
compression significantly improves overall performance. 

> > A 2x performance difference is almost inconsequential in my
> > experience, where growth is exponential. 2x performance change
> > generally means 1 to 2 years of advancement or deferment against the
> > progression of hardware; our current, relatively beefy DB servers
> > are already older than that, and have an anticipated life cycle of at
> > leastanother couple years.
> 
> I'm not sure I understand what you suggest here. What I'm saying is that
> when I do a stress test on the same hardware, I do get ~2x the
> throughput with EXT4/XFS, compared to ZFS.

What I'm saying is only what it says on its face: A 50% performance difference 
is rarely enough to make or break a production system; performance/capacity 
reserves of 95% or more are fairly typical, which means the difference between 
5% utilization and 10%. Even if latency rose by 50%, that's typically the 
difference between 20ms and 30ms, not enough that, over the 'net for a 
SOAP/REST call, that anybody'd notice even if it's enough to make you want to 
optimize things a bit. 

> > // Our situation // Lots of RAM for the workload: 128 GB of ECC RAM
> > with an on-disk DB size of ~ 150 GB. Pretty much, everything runs
> > straight out of RAM cache, with only writes hitting disk. Smart
> > reports 4/96 read/write ratio.
> 
> So your active set fits into RAM? I'd guess all your writes are then WAL
> + checkpoints, which probably makes them rather sequential.
> 
> If that's the case, CoW filesystems may perform quite well - I was
> mostly referring to workloads with a lot of random writes to he device.

That's *MY* hope, anyway! :) 

> > Query load: Constant, heavy writes and heavy use of temp tables in
> > order to assemble very complex queries. Pretty much the "worst case"
> > mix of reads and writes, average daily peak of about 200-250
> > 
>  > queries/second.
> 
> I'm not sure how much random I/O that actually translates to. According
> to the numbers I've posted to this thread few hours ago, a tuned ZFS on
> a single SSD device handles ~2.5k tps (with dataset ~2x the RAM). But
> those are OLTP queries - your queries may write much more data. OTOH it
> really does not matter that much if your active set fits into RAM,
> because then it's mostly about writing to ZIL.

I personally don't yet know how much sense an SSD-backed ZIL makes when the 
storage media is also SSD-based. 

> > 16 Core XEON servers, 32 HT "cores".
> > 
> > SAS 3 Gbps
> > 
> > CentOS 6 is our O/S of choice.
> > 
> > Currently, we're running Intel 710 SSDs in a software RAID1 without
> > trim enabled and generally happy with the reliability and performance
> > we see. We're planning to upgrade storage soon (since we're over 50%
> > utilization) and in the process, bring the magic goodness of
> > snapshots/clones from ZFS.
> 
> I presume by "software RAID1" you mean "mirrored vdev zpool", correct?

I mean "software RAID 1" with Linux/mdadm. We haven't put ZFS into production 
use on any of our DB servers, yet. 

Thanks for your input. 

Ben

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general