Re: WSL (windows subsystem on linux) users will need to turn fsync off as of 11.2

Thomas Munro <thomas.munro@xxxxxxxxxxxxxxxx> · Sun, 17 Feb 2019 23:29:09 +1300

On Sun, Feb 17, 2019 at 4:56 AM Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
> Thomas Munro <thomas.munro@xxxxxxxxxxxxxxxx> writes:
> >>> Really?  The specification says that it starts I/O, not that it waits
> >>> around for any to finish.
>
> > Right, there was some discussion of that, and I didn't (and still
> > don't) think it'd be wise to rely on undocumented knowledge about
> > which flags can eat errors based on a drive-by reading of a particular
> > snapshot of the Linux tree.  The man page says it can return EIO; I
> > think we should assume that it might actually do that.
>
> I had a thought about this: maybe we should restrict the scope of this
> behavior to be "panic on EIO", not "panic on anything within hailing
> distance of fsync".
>
> The direction you and Andres seem to want to go in is to add a pile of
> unprincipled exception cases, which seems like a recipe for constant
> pain to me.  I think we might be better off with a whitelist of errnos
> that mean trouble, instead of a blacklist of some that don't.  I'm
> especially troubled by the idea that blacklisting some errnos might
> reduce to ignoring them completely, which would be a step backwards
> from our pre-PANIC behavior.

Hmm.  Well, at least ENOSPC should be treated the same way as EIO.
Here's an experiment that seems to confirm some speculations about NFS
on Linux from the earlier threads:

$ uname -a
Linux debian 4.18.0-3-amd64 #1 SMP Debian 4.18.20-2 (2018-11-23)
x86_64 GNU/Linux
$ dpkg -l nfs-kernel-server | tail -1
ii  nfs-kernel-server 1:1.3.4-2.4  amd64        support for NFS kernel server

First, set up a 10MB loop-back filesystem:

$ dd if=/dev/zero of=/tmp/10mb.loopback bs=1024 count=10000
$ sudo losetup /dev/loop0 /tmp/10mb.loopback
$ sudo mkfs -t ext3 -m 1 -v /dev/loop0
...
$ sudo mkdir /mnt/test_loopback
$ sudo mount -t ext3 /dev/loop0 /mnt/test_loopback

Then, export that via NFS:

$ tail -1 /etc/exports
/mnt/test_loopback localhost(rw,sync,no_subtree_check)
$ sudo exportfs -av
exporting localhost:/mnt/test_loopback

Next, mount that over NFS:

$ sudo mkdir /mnt/test_loopback_remote
$ sudo mount localhost:/mnt/test_loopback /mnt/test_loopback_remote

Now, fill up the whole disk with a file full of newlines:

$ sudo mkdir /mnt/test_loopback/dir
$ sudo chown $USER:$USER /mnt/test_loopback/dir
$ tr "\000" "\n" < /dev/zero > /mnt/test_loopback_remote/dir/file
tr: write error: No space left on device
tr: write error
$ df -h /mnt/test_loopback*
Filesystem                    Size  Used Avail Use% Mounted on
/dev/loop0                    8.5M  8.4M     0 100% /mnt/test_loopback
localhost:/mnt/test_loopback  8.5M  8.4M     0 100% /mnt/test_loopback_remote

Now, run a program that appends a greeting and then calls fsync() twice:

$ cat test.c
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
        int fd, rc;

        fd = open("/mnt/test_loopback_remote/dir/file", O_RDWR | O_APPEND);
        if (fd < 0)
        {
                perror("open");
                return 1;
        }
        rc = write(fd, "hello world\n", 12);
        if (rc < 0)
                perror("write");
        else if (rc < 12)
                fprintf(stderr, "only managed to write %d bytes\n", rc);
        rc = fsync(fd);
        if (rc < 0)
                perror("fsync 1");
        rc = fsync(fd);
        if (rc < 0)
                perror("fsync 2");
        rc = close(fd);
        if (rc < 0)
                perror("close");

        return 0;
}
$ cc test.c
$ ./a.out
fsync 1: No space left on device
$

The write() and the second fsync() reported success.  Great, let's go
and look at our precious data, both through NFS and locally:

$ tail -3 /mnt/test_loopback_remote/dir/file

$ tail -3 /mnt/test_loopback/dir/file

$

It's gone.  If you try it again with a file containing just a few
newlines so there is free space, it works correctly and you see the
appended greeting.  Perhaps the same sort of thing might happen with
remote EDQUOT, but I haven't tried that.  Perhaps there are some
things that could be tuned that would avoid that?

(Some speculation about NFS:  To avoid data-loss from running out of
disk space, I think PostgreSQL requires either a filesystem that
reserves space when we're extending a file, so that we can exclude the
possibility of ENOSPC before we evict data from our own shared
buffers, or a page cache that doesn't drop dirty flags or whole
buffers on failure so we can meaningfully retry once space becomes
available.  As far as I know, the former would be theoretically
possible with NFS, if the client and server are using NFSv4.2+ with
ALLOCATE support and glibc and kernel versions both support true
fallocate() and pass it all the way through, but current versions
either don't support fallocate() on NFS files at all (this 4.18 kernel
doesn't) or sometimes emulate it by writing zeroes, which is useless
for remote space reservation purposes and (according to some sources I
found) there is currently no reliable way to find out about that
though libc.  If that situation improves, we'd still need to do
explicit fallocate() on our side to reserve space, and it'd probably
be slow so we might have to work in bigger chunks to amortise the
latency.)

Returning to your question of how to decide whether to have an errno
include-list or an exclude-list for our new panic behaviour, I think
we should tolerate ENOSYS as a special case for sync_file_range()
only, because:

1.  We don't actually need sync_file_range() at all for correct
operation (unlike fsync()).
2.  ENOSYS is the only errno that very explicitly says "this didn't go
anywhere near Linux filesystem code".  Therefore, it definitely didn't
eat any errors relating to jettisoned data.  (Ok, perhaps EBADF and
EINVAL tell you something similar, but they also imply a serious bug
in the calling code, having lost track of file descriptors or
something.)

So far I still think that we should panic if fsync() returns any error
number at all.  For sync_file_range(), it sounds like maybe you think
we should leave the warning-spewing in there for ENOSYS, to do exactly
what we did before on principle since that's what back-branches are
all about?  Something like:

  ereport(errno == ENOSYS ? WARNING : data_sync_elevel(WARNING),

Perhaps for master we could skip it completely, or somehow warn just
once, and/or switch to one of our other implementations at runtime?  I
don't really have a strong view on that, not being a user of that
system.  Will they ever implement it?  Are there other systems we care
about that don't implement it?  (Android?)

--
Thomas Munro
http://www.enterprisedb.com