On Sun, Feb 17, 2019 at 4:56 AM Tom Lane <tgl@xxxxxxxxxxxxx> wrote: > Thomas Munro <thomas.munro@xxxxxxxxxxxxxxxx> writes: > >>> Really? The specification says that it starts I/O, not that it waits > >>> around for any to finish. > > > Right, there was some discussion of that, and I didn't (and still > > don't) think it'd be wise to rely on undocumented knowledge about > > which flags can eat errors based on a drive-by reading of a particular > > snapshot of the Linux tree. The man page says it can return EIO; I > > think we should assume that it might actually do that. > > I had a thought about this: maybe we should restrict the scope of this > behavior to be "panic on EIO", not "panic on anything within hailing > distance of fsync". > > The direction you and Andres seem to want to go in is to add a pile of > unprincipled exception cases, which seems like a recipe for constant > pain to me. I think we might be better off with a whitelist of errnos > that mean trouble, instead of a blacklist of some that don't. I'm > especially troubled by the idea that blacklisting some errnos might > reduce to ignoring them completely, which would be a step backwards > from our pre-PANIC behavior. Hmm. Well, at least ENOSPC should be treated the same way as EIO. Here's an experiment that seems to confirm some speculations about NFS on Linux from the earlier threads: $ uname -a Linux debian 4.18.0-3-amd64 #1 SMP Debian 4.18.20-2 (2018-11-23) x86_64 GNU/Linux $ dpkg -l nfs-kernel-server | tail -1 ii nfs-kernel-server 1:1.3.4-2.4 amd64 support for NFS kernel server First, set up a 10MB loop-back filesystem: $ dd if=/dev/zero of=/tmp/10mb.loopback bs=1024 count=10000 $ sudo losetup /dev/loop0 /tmp/10mb.loopback $ sudo mkfs -t ext3 -m 1 -v /dev/loop0 ... $ sudo mkdir /mnt/test_loopback $ sudo mount -t ext3 /dev/loop0 /mnt/test_loopback Then, export that via NFS: $ tail -1 /etc/exports /mnt/test_loopback localhost(rw,sync,no_subtree_check) $ sudo exportfs -av exporting localhost:/mnt/test_loopback Next, mount that over NFS: $ sudo mkdir /mnt/test_loopback_remote $ sudo mount localhost:/mnt/test_loopback /mnt/test_loopback_remote Now, fill up the whole disk with a file full of newlines: $ sudo mkdir /mnt/test_loopback/dir $ sudo chown $USER:$USER /mnt/test_loopback/dir $ tr "\000" "\n" < /dev/zero > /mnt/test_loopback_remote/dir/file tr: write error: No space left on device tr: write error $ df -h /mnt/test_loopback* Filesystem Size Used Avail Use% Mounted on /dev/loop0 8.5M 8.4M 0 100% /mnt/test_loopback localhost:/mnt/test_loopback 8.5M 8.4M 0 100% /mnt/test_loopback_remote Now, run a program that appends a greeting and then calls fsync() twice: $ cat test.c #include <fcntl.h> #include <stdio.h> #include <unistd.h> int main(int argc, char *argv[]) { int fd, rc; fd = open("/mnt/test_loopback_remote/dir/file", O_RDWR | O_APPEND); if (fd < 0) { perror("open"); return 1; } rc = write(fd, "hello world\n", 12); if (rc < 0) perror("write"); else if (rc < 12) fprintf(stderr, "only managed to write %d bytes\n", rc); rc = fsync(fd); if (rc < 0) perror("fsync 1"); rc = fsync(fd); if (rc < 0) perror("fsync 2"); rc = close(fd); if (rc < 0) perror("close"); return 0; } $ cc test.c $ ./a.out fsync 1: No space left on device $ The write() and the second fsync() reported success. Great, let's go and look at our precious data, both through NFS and locally: $ tail -3 /mnt/test_loopback_remote/dir/file $ tail -3 /mnt/test_loopback/dir/file $ It's gone. If you try it again with a file containing just a few newlines so there is free space, it works correctly and you see the appended greeting. Perhaps the same sort of thing might happen with remote EDQUOT, but I haven't tried that. Perhaps there are some things that could be tuned that would avoid that? (Some speculation about NFS: To avoid data-loss from running out of disk space, I think PostgreSQL requires either a filesystem that reserves space when we're extending a file, so that we can exclude the possibility of ENOSPC before we evict data from our own shared buffers, or a page cache that doesn't drop dirty flags or whole buffers on failure so we can meaningfully retry once space becomes available. As far as I know, the former would be theoretically possible with NFS, if the client and server are using NFSv4.2+ with ALLOCATE support and glibc and kernel versions both support true fallocate() and pass it all the way through, but current versions either don't support fallocate() on NFS files at all (this 4.18 kernel doesn't) or sometimes emulate it by writing zeroes, which is useless for remote space reservation purposes and (according to some sources I found) there is currently no reliable way to find out about that though libc. If that situation improves, we'd still need to do explicit fallocate() on our side to reserve space, and it'd probably be slow so we might have to work in bigger chunks to amortise the latency.) Returning to your question of how to decide whether to have an errno include-list or an exclude-list for our new panic behaviour, I think we should tolerate ENOSYS as a special case for sync_file_range() only, because: 1. We don't actually need sync_file_range() at all for correct operation (unlike fsync()). 2. ENOSYS is the only errno that very explicitly says "this didn't go anywhere near Linux filesystem code". Therefore, it definitely didn't eat any errors relating to jettisoned data. (Ok, perhaps EBADF and EINVAL tell you something similar, but they also imply a serious bug in the calling code, having lost track of file descriptors or something.) So far I still think that we should panic if fsync() returns any error number at all. For sync_file_range(), it sounds like maybe you think we should leave the warning-spewing in there for ENOSYS, to do exactly what we did before on principle since that's what back-branches are all about? Something like: ereport(errno == ENOSYS ? WARNING : data_sync_elevel(WARNING), Perhaps for master we could skip it completely, or somehow warn just once, and/or switch to one of our other implementations at runtime? I don't really have a strong view on that, not being a user of that system. Will they ever implement it? Are there other systems we care about that don't implement it? (Android?) -- Thomas Munro http://www.enterprisedb.com