On Wednesday, September 30, 2015 09:58:08 PM Tomas Vondra wrote: > On 09/30/2015 07:33 PM, Benjamin Smith wrote: > > On Wednesday, September 30, 2015 02:22:31 PM Tomas Vondra wrote: > >> I think this really depends on the workload - if you have a lot of > >> random writes, CoW filesystems will perform significantly worse than > >> e.g. EXT4 or XFS, even on SSD. > > > > I'd be curious about the information you have that leads you to this > > conclusion. As with many (most?) "rules of thumb", the devil is > > quiteoften the details. > > A lot of testing done recently, and also experience with other CoW > filesystems (e.g. BTRFS explicitly warns about workloads with a lot of > random writes). > > >>> We've been running both on ZFS/CentOS 6 with excellent results, and > >>> are considering putting the two together. In particular, the CoW > >>> nature (and subsequent fragmentation/thrashing) of ZFS becomes > >>> largely irrelevant on SSDs; the very act of wear leveling on an SSD > >>> is itself a form of intentional thrashing that doesn't affect > >>> performance since SSDs have no meaningful seek time. > >> > >> I don't think that's entirely true. Sure, SSD drives handle random I/O > >> much better than rotational storage, but it's not entirely free and > >> sequential I/O is still measurably faster. > >> > >> It's true that the drives do internal wear leveling, but it probably > >> uses tricks that are impossible to do at the filesystem level (which is > >> oblivious to internal details of the SSD). CoW also increases the amount > >> of blocks that need to be reclaimed. > >> > >> In the benchmarks I've recently done on SSD, EXT4 / XFS are ~2x > >> faster than ZFS. But of course, if the ZFS features are interesting > >> for you, maybe it's a reasonable price. > > > > Again, the details would be highly interesting to me. What memory > > optimization was done? Status of snapshots? Was the pool RAIDZ or > > mirrored vdevs? How many vdevs? Was compression enabled? What ZFS > > release was this? Was this on Linux,Free/Open/Net BSD, Solaris, or > > something else? > > I'm not sure what you mean by "memory optimization" so the answer is > probably "no". I mean the full gamut: Did you use an l2arc? Did you use a dedicated ZIL? What was arc_max set to? How much RAM/GB was installed on the machine? How did you set up PG? (PG defaults are historically horrible for higher-RAM machines) > FWIW I don't have much experience with ZFS in production, all I have is > data from benchmarks I've recently done exactly with the goal to educate > myself on the differences of current filesystems. > > The tests were done on Linux, with kernel 4.0.4 / zfs 0.6.4. So fairly > recent versions, IMHO. > > My goal was to test the file systems under the same conditions and used > a single device (Intel S3700 SSD). I'm aware that this is not a perfect > test and ZFS offers interesting options (e.g. moving ZIL to a separate > device). I plan to benchmark some additional configurations with more > devices and such. Also, did you try with/without compression? My information so far is that compression significantly improves overall performance. > > A 2x performance difference is almost inconsequential in my > > experience, where growth is exponential. 2x performance change > > generally means 1 to 2 years of advancement or deferment against the > > progression of hardware; our current, relatively beefy DB servers > > are already older than that, and have an anticipated life cycle of at > > leastanother couple years. > > I'm not sure I understand what you suggest here. What I'm saying is that > when I do a stress test on the same hardware, I do get ~2x the > throughput with EXT4/XFS, compared to ZFS. What I'm saying is only what it says on its face: A 50% performance difference is rarely enough to make or break a production system; performance/capacity reserves of 95% or more are fairly typical, which means the difference between 5% utilization and 10%. Even if latency rose by 50%, that's typically the difference between 20ms and 30ms, not enough that, over the 'net for a SOAP/REST call, that anybody'd notice even if it's enough to make you want to optimize things a bit. > > // Our situation // Lots of RAM for the workload: 128 GB of ECC RAM > > with an on-disk DB size of ~ 150 GB. Pretty much, everything runs > > straight out of RAM cache, with only writes hitting disk. Smart > > reports 4/96 read/write ratio. > > So your active set fits into RAM? I'd guess all your writes are then WAL > + checkpoints, which probably makes them rather sequential. > > If that's the case, CoW filesystems may perform quite well - I was > mostly referring to workloads with a lot of random writes to he device. That's *MY* hope, anyway! :) > > Query load: Constant, heavy writes and heavy use of temp tables in > > order to assemble very complex queries. Pretty much the "worst case" > > mix of reads and writes, average daily peak of about 200-250 > > > > queries/second. > > I'm not sure how much random I/O that actually translates to. According > to the numbers I've posted to this thread few hours ago, a tuned ZFS on > a single SSD device handles ~2.5k tps (with dataset ~2x the RAM). But > those are OLTP queries - your queries may write much more data. OTOH it > really does not matter that much if your active set fits into RAM, > because then it's mostly about writing to ZIL. I personally don't yet know how much sense an SSD-backed ZIL makes when the storage media is also SSD-based. > > 16 Core XEON servers, 32 HT "cores". > > > > SAS 3 Gbps > > > > CentOS 6 is our O/S of choice. > > > > Currently, we're running Intel 710 SSDs in a software RAID1 without > > trim enabled and generally happy with the reliability and performance > > we see. We're planning to upgrade storage soon (since we're over 50% > > utilization) and in the process, bring the magic goodness of > > snapshots/clones from ZFS. > > I presume by "software RAID1" you mean "mirrored vdev zpool", correct? I mean "software RAID 1" with Linux/mdadm. We haven't put ZFS into production use on any of our DB servers, yet. Thanks for your input. Ben -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general