Hi, You probably already considered that - sorry, if so… Instead of the mutex Windows use ExecutiveResource with shared and exclusive semantics. Readers serialize by taking the resource shared and writers take it exclusive. I have that implemented for Linux. Please, let me know if there is any interest! Sent from my Verizon Wireless Droid -----Original message----- From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> To: "Michael Kerrisk (man-pages)" <mtk.manpages@xxxxxxxxx> Cc: lkml <linux-kernel@xxxxxxxxxxxxxxx>, Miklos Szeredi <miklos@xxxxxxxxxx>, Theodore T'so <tytso@xxxxxxx>, Christoph Hellwig <hch@xxxxxx>, Chris Mason <clm@xxxxxx>, Dave Chinner <david@xxxxxxxxxxxxx>, Linux-Fsdevel <linux-fsdevel@xxxxxxxxxxxxxxx>, Al Viro <viro@xxxxxxxxxxxxxxxxxx>, "J. Bruce Fields" <bfields@xxxxxxxxxxxxxx>, Yongzhi Pan <panyongzhi@xxxxxxxxx> Sent: Thu, Feb 20, 2014 17:15:07 GMT+00:00 Subject: Re: Update of file offset on write() etc. is non-atomic with I/O Yes, I do think we violate POSIX here because of how we handle f_pos (the earlier thread from 2006 you point to talks about the "thread safe" part, the point here about the actual wording of "atomic" is a separate issue). Long long ago we used to just pass in the pointer to f_pos directly, and then the low-level write would update it all under the inode semaphore, and all was good. And then we ended up having tons of problems with non-regular files and drivers accessing f_pos and having nasty races with it because they did *not* have any locking (and very fundamentally didn't want any), and we broke the serialization of f_pos. We still do the *IO* atomically, but yes, the f_pos access itself is outside the lock. Ho humm.. Al, any ideas of how to fix this? Linus On Mon, Feb 17, 2014 at 7:41 AM, Michael Kerrisk (man-pages) <mtk.manpages@xxxxxxxxx> wrote: > Hello all, > > A note from Yongzhi Pan about some of my own code led me to dig deeper > and discover behavior that is surprising and also seems to be a > fairly clear violation of POSIX requirements. > > It appears that write() (and, presumably read() and other similar > system calls) are not atomic with respect to performing I/O and > updating the file offset behavior. > > The problem can be demonstrated using the program below. > That program takes three arguments: > > $ ./multi_writer num-children num-blocks block-size > somefile > > It creates 'num-children' children, each of which writes 'num-blocks' > blocks of 'block-size' bytes to standard output; for my experiments, > stdout is redirected to a file. After all children have finished, > the parent inspects the size of the file written on stdout, calculates > the expected size of the file, and displays these two values, and > their difference on stderr. > > Some observations: > > * All children inherit the stdout file descriptor from the parent; > thus the FDs refer to the same open file description, and therefore > share the file offset. > > * When I run this on a multi-CPU BSD systems, I get the expected result: > > $ ./multi_writer 10 10000 1000 > g 2> run.log > $ ls -l g > -rw------- 1 mkerrisk users 100000000 Jan 17 07:34 g > > * Someone else tested this code for me on a Solaris system, and also got > the expected result. > > * On Linux, by contrast, we see behavior such as the following: > > $ ./multi_writer 10 10000 1000 > g > Expected file size: 100000000 > Actual file size: 16323000 > Difference: 83677000 > $ ls -l g > -rw-r--r--. 1 mtk mtk 16323000 Feb 17 16:05 g > > Summary of the above output: some children are overwriting the output > of other children because output is not atomic with respect to updates > to the file offset. > > For reference, POSIX.1-2008/SUSv4 Section XSI 2.9.7 says: > > [[ > 2.9.7 Thread Interactions with Regular File Operations > > All of the following functions shall be atomic with respect to each other > in the effects specified in POSIX.1-2008 when they operate on regular > files or symbolic links: > > > chmod() > ... > pread() > read() > ... > readv() > pwrite() > ... > write() > writev() > > > If two threads each call one of these functions, each call shall either > see all of the specified effects of the other call, or none of them. > ]] > > (POSIX.1-2001 has similar text.) > > This text is in one of the Threads sections, but it applies equally > to threads in different processes as to threads in the same process. > > I've tested the code below on ext4, XFS, and BtrFS, on kernel 3.12 and a > number of other recent kernels, all with similar results, which suggests > the result is in the VFS layer. (Can it really be so simple as no locking > around pieces such as > > loff_t pos = file_pos_read(f.file); > ret = vfs_write(f.file, buf, count, &pos); > if (ret >= 0) > file_pos_write(f.file, pos); > > in fs/read_write.c?) > > I discovered this behavior after Yongzhi Pan reported some unexpected > behavior in some of my code that forked to create a parent and > child that wrote to the same file. In some cases, expected output > was not appearing. In other words, after a fork(), and in the absence > of any other synchronization technique, a parent and a child cannot > safely write to the same file descriptor without risking overwriting > each other's output. But POSIX requires this, and other systems seem > to guarantee it. > > Am I correct to think there's a kernel problem here? > > Thanks, > > Michael > > === > > /* multi_writer.c > */ > > #include <sys/wait.h> > #include <sys/types.h> > #include <stdio.h> > #include <stdlib.h> > #include <unistd.h> > #include <sys/fcntl.h> > #include <sys/stat.h> > #include <string.h> > #include <errno.h> > > typedef enum { FALSE, TRUE } Boolean; > > #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \ > &nbs ��.n��������+%������w��{.n�����{���)��jg��������ݢj����G�������j:+v���w�m������w�������h�����٥