On Wed, Aug 19, 2009 at 06:40:33AM -0600, Cornelius, Martin (DWBI) wrote: > > Hi linux-filesystem experts > > First, please apologize if this is the wrong place to ask this question > -- we googled around a lot and couldn't find an answer, that's why we > finally try it here. > > The actual cause of the question is our reasoning about the robustness > of the openssh code. Every invocation of ssh possibly adds a line to the > file $(HOME)/.ssh/known_hosts, and (contrary to our expectations) we > couldn't find any explicit locking in the code. Instead, the ssh code > just opens the file with O_APPEND, writes to the file, and closes it. We > already conducted a simple test that tries to create a 'corrupted' > known_host files by starting lots of ssh commands concurrently, but so > far we could not observe corruption. We now wonder if this is just by > luck or if a programmer can rely on this behaviour. > > The generalized question is: If two (or more) different processes open > the same file on a !LOCAL! disk with O_APPEND, and then concurrently > issue write() calls to store data into this file, is there any guarantee > that the data of each single write() call are written 'atomically', or > could it happen that the data of different write()s are mangled or one > write() overwrites data already written ? To prevent misunderstandings, > we assume that ALL writers have opended the file with O_APPEND, and all > write calls return normally without being interrupted by a signal. > So looking at the code, with O_APPEND set, every time the app calls write() the position it's writing to is set to the end of the file. It looks like most people (with the exception of btrfs) will be holding the inode->i_mutex when they do a generic_write_checks, which gives the position to write to. So the position to write to and then the subsequent writing are atomic, so unless the fs is btrfs (which may or may not be a bug, I'll leave that to the smarter people), O_APPEND should appear to be atomic. > The Posix standard states that adavancing the filepointer to the end of > the file and the following execution of the write are performed > atomically with O_APPEND, but as far as we grasp it does not state if > the actual write is also atomic w.r.t. other concurrrent write calls. > > If there is some guarantee : > - does a (perhaps filesystem dependent) limit for this guarantee exist ? > (like the PIPE_BUF size limit when writing to a pipe), and is there a > way to detect this limit programmatically ? Like I said, it seems most people hold the i_mutex when doing the check, but it appears btrfs does not. I think it's a bug, but I'm not sure. There would not be a way to tell programmatically. > - does this guarantee also hold, if several threads in one process write > to a single file DESCRIPTOR concurrently ? Yes, the position is set every single time write() is called. > - does this guarantee also hold for remote filesystems (nfs / smb) ? > This I'm more likely to be wrong on, but I don't think so. It would be atomic on the local machine, but if there is somebody else on another machine writing to the same file I think you would probably be screwed. > If the answer to the last question is 'no' : is there a simple way to > programmatically detect whether the guarantee holds for a specific file > ? I don't think so. Really your best bet if you are going to do a remote fs that can have concurrent writers that have no knowledge of eachother is to use fcntl. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html