Re: [PATCH] mingw: enable atomic O_APPEND

Jeff King <peff@xxxxxxxx> · Mon, 13 Aug 2018 18:37:01 -0400

On Mon, Aug 13, 2018 at 11:22:10PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > O_APPEND is POSIX and means race-free append. If you mark some call
> > sites with O_APPEND, then that must be the ones that need race-free
> > append. Hence, you would have to go the other route: Mark those call
> > sites that do _not_ need race-free append with some custom
> > function/macro. (Or mark both with different helpers and avoid writing
> > down O_APPEND.)
> 
> O_APPEND in POSIX is race-free only up to PIPE_MAX bytes written at a
> time, which is e.g. 2^12 by default on linux, after that all bets are
> off and the kernel is free to interleave different write calls.

This is a claim I've run across often, but I've never seen a good
citation for it.

Certainly atomic writes to _pipes_ are determined by PIPE_BUF (which
IIRC is not even a constant on Linux, but can be changed at run-time).
But is it relevant for regular-file writes?

Another gem I found while digging on this O_APPEND/FILE_APPEND_DATA
stuff the other day: somebody claimed that the max atomic-append size on
Linux is 4096 and 1024 on Windows. But their experimental script was
done in bash! So I suspect they were really just measuring the size of
stdio buffers.

Here's my attempt at a test setup. This C program forces two processes
to write simultaneously to the same file with O_APPEND:

-- >8 --
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <sys/types.h>
#include <fcntl.h>
#include <unistd.h>

static void doit(int size, const char *fn, char c)
{
	int fd;
	char *buf;

	fd = open(fn, O_WRONLY|O_APPEND|O_CREAT, 0666);
	if (fd < 0) {
		perror("open");
		return;
	}

	buf = malloc(size);
	memset(buf, c, size);

	while (1)
		write(fd, buf, size);
}

int main(int argc, const char **argv)
{
	int size = atoi(argv[1]);

	if (fork())
		doit(size, argv[2], '1');
	else
		doit(size, argv[2], '2');
	return 0;
}
-- 8< --

and then this program checks that we saw atomic units of the correct
size:

-- >8 --
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

int main(int argc, const char **argv)
{
	int size = atoi(argv[1]);
	char *buf;

	buf = malloc(size);
	while (1) {
		int i;
		/* assume atomic reads, i.e., no signals */
		int r = read(0, buf, size);
		if (!r)
			break;
		for (i = 1; i < size; i++) {
			if (buf[i] != buf[0]) {
				fprintf(stderr, "overlap\n");
				return 1;
			}
		}
	}
	return 0;
}
-- 8< --

And then you can do something like:

  for size in 4097 8193 16385 32769 65537 131073 262145 524289 1048577; do
    >out ;# clean up from last run
    echo "Trying $size..."
    timeout 5 ./write $size out
    if ! ./check $size <out; then
      echo "$size failed"
      break
    fi
  done

On my Linux system, each of those seems to write several gigabytes
without overlapping. I did manage to hit some failing cases, but they
were never sheared writes, but rather cases where there was an
incomplete write at the end-of-file.

So obviously this is all a bit of a tangent. I'd be fine declaring that
trace output is generally small enough not to worry about this in the
first place. But those results show that it shouldn't matter even if
we're writing 1MB trace lines on Linux. I wouldn't be at all surprised
to see different results on other operating systems, though.

-Peff