On Mon, Aug 13, 2018 at 11:22:10PM +0200, Ævar Arnfjörð Bjarmason wrote: > > O_APPEND is POSIX and means race-free append. If you mark some call > > sites with O_APPEND, then that must be the ones that need race-free > > append. Hence, you would have to go the other route: Mark those call > > sites that do _not_ need race-free append with some custom > > function/macro. (Or mark both with different helpers and avoid writing > > down O_APPEND.) > > O_APPEND in POSIX is race-free only up to PIPE_MAX bytes written at a > time, which is e.g. 2^12 by default on linux, after that all bets are > off and the kernel is free to interleave different write calls. This is a claim I've run across often, but I've never seen a good citation for it. Certainly atomic writes to _pipes_ are determined by PIPE_BUF (which IIRC is not even a constant on Linux, but can be changed at run-time). But is it relevant for regular-file writes? Another gem I found while digging on this O_APPEND/FILE_APPEND_DATA stuff the other day: somebody claimed that the max atomic-append size on Linux is 4096 and 1024 on Windows. But their experimental script was done in bash! So I suspect they were really just measuring the size of stdio buffers. Here's my attempt at a test setup. This C program forces two processes to write simultaneously to the same file with O_APPEND: -- >8 -- #include <stdlib.h> #include <string.h> #include <stdio.h> #include <sys/types.h> #include <fcntl.h> #include <unistd.h> static void doit(int size, const char *fn, char c) { int fd; char *buf; fd = open(fn, O_WRONLY|O_APPEND|O_CREAT, 0666); if (fd < 0) { perror("open"); return; } buf = malloc(size); memset(buf, c, size); while (1) write(fd, buf, size); } int main(int argc, const char **argv) { int size = atoi(argv[1]); if (fork()) doit(size, argv[2], '1'); else doit(size, argv[2], '2'); return 0; } -- 8< -- and then this program checks that we saw atomic units of the correct size: -- >8 -- #include <stdlib.h> #include <stdio.h> #include <unistd.h> int main(int argc, const char **argv) { int size = atoi(argv[1]); char *buf; buf = malloc(size); while (1) { int i; /* assume atomic reads, i.e., no signals */ int r = read(0, buf, size); if (!r) break; for (i = 1; i < size; i++) { if (buf[i] != buf[0]) { fprintf(stderr, "overlap\n"); return 1; } } } return 0; } -- 8< -- And then you can do something like: for size in 4097 8193 16385 32769 65537 131073 262145 524289 1048577; do >out ;# clean up from last run echo "Trying $size..." timeout 5 ./write $size out if ! ./check $size <out; then echo "$size failed" break fi done On my Linux system, each of those seems to write several gigabytes without overlapping. I did manage to hit some failing cases, but they were never sheared writes, but rather cases where there was an incomplete write at the end-of-file. So obviously this is all a bit of a tangent. I'd be fine declaring that trace output is generally small enough not to worry about this in the first place. But those results show that it shouldn't matter even if we're writing 1MB trace lines on Linux. I wouldn't be at all surprised to see different results on other operating systems, though. -Peff