Re: HFS+ pg_test_fsync performance

Josh Berkus <josh@xxxxxxxxxxxx> · Wed, 23 Apr 2014 11:04:59 -0700

Mel,

> I was given anecdotal information regarding HFS+ performance under OSX as
> being unsuitable for production PG deployments and that pg_test_fsync
> could be used to measure the relative speed versus other operating systems

You're welcome to identify your source of anecdotal evidence: me.  And
it's based on experience and testing, although I'm not allowed to share
the raw results.  Let's just say that I was part of two different
projects where we moved from using OSX on Apple hardware do using Linux
on the *same* hardware ... because of intolerable performance on OSX.
Switching to Linux more than doubled our real write throughput.

> Compare file sync methods using one 8kB write:
> (in wal_sync_method preference order, except fdatasync
> is Linux's default)
>         open_datasync                    8752.240 ops/sec     114 usecs/op
>         fdatasync                        8556.469 ops/sec     117 usecs/op
>         fsync                            8831.080 ops/sec     113 usecs/op
============================================================================
>         fsync_writethrough                735.362 ops/sec    1360 usecs/op
============================================================================
>         open_sync                        8967.000 ops/sec     112 usecs/op

fsync_writethrough is the *only* relevant stat above.  For all of the
other fsync operations, OSX is "faking it"; returning to the calling
code without ever flushing to disk.  This would result in data
corruption in the event of an unexpected system shutdown.

Both OSX and Windows do this, which is why we *have* fsync_writethrough.
 Mind you, I'm a little shocked that performance is still so bad on
SSDs; I'd attributed HFS's slow fsync mostly to waiting for a full disk
rotation, but apparently the primary cause is something else.

You'll notice that the speed of fsync_writethrough is 1/4 that of
comparable speed on Linux.   You can get similar performance on Linux by
putting your WAL on a ramdisk, but I don't recommend that for production.

But: things get worse.  In addition to the very slow speed on real
fsyncs, HFS+ has very primitive IO scheduling for multiprocessor
workloads; the filesystem was designed for single-core machines (in
1995!) and has no ability to interleave writes from multiple concurrent
processes.  This results in "stuttering" as the IO system tries to
service first one write request, then another, and ends up stalling
both.  If you do, for example, a serious ETL workload with parallelism
on OSX, you'll see that IO throughput describes a sawtooth from full
speed to zero, being near-zero about half the time.

So not only are fsyncs slower, real throughput for sustained writes on
HFS+ are 50% or less of the hardware maximum in any real multi-user
workload.

In order to test this, you'd need a workload which required loading and
sorting several tables larger than RAM, at least two in parallel.

In the words of the lead HFS+ developer, Don Brady:  "Since we believed
it was only a stop gap solution, we just went from 16 to 32 bits. Had we
known that it would still be in use 15 years later with multi-terabyte
drives, we probably would have done more design changes!"

HFS+ was written in about 6 months, and is largely unimproved since its
release in 1995.  Ext2 doesn't perform too well, either; the difference
is that Linux users have alternative filesystems available.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance