On 11/09/2010 04:41 PM, Andrew Morton wrote:
yup. It's a userspace bug, really. Although that bug might be expressed as "userspace didn't know about linux-specific EIO behaviour".
Looking at this some more, I am not convinced this is a userspace bug. First, let me describe the problem scenario: 1) process A calls write 2) process B calls write 3) process A calls fsync, runs into an IO error, returns -EIO 4) process B calls fsync, returns success (even though data could have been lost!) Common sense, as well as these snippets from the fsync man page, suggest that this behaviour is incorrect: DESCRIPTION fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device ... RETURN VALUE On success, these system calls return zero. On error, -1 is returned, and errno is set appropriately. Johannes had the idea of storing an IO error sequence idea in the file descriptor and the address_space (or inode). That way when an IO error happens, we can increment the IO error sequence counter (mapping->io_error_seq). When doing an fsync or msync, we can see whether the inode's IO error sequence has been incremented since the last time we looked. If it has, we set our own counter to the same number and return -EIO to userspace. The problem here is that the same file descriptor may be shared between multiple processes. For example, on fork the child process inherits file descriptors from the parent. If the child reads 4kB, the file position indicator (f_pos) will be changed for the parent (and siblings), too. This is an odd quirk of Unix semantics, maybe inherited from vfork (I suspect it would have been easier for everybody if the child got a _copy_ of each file descriptor, instead of sharing them with parent and siblings), but chances are there is some program out there by now that relies on this behaviour. Given that the file descriptors are shared, we will run into the same race above, when process A and B are siblings or parent & child. That leaves the solution of putting the IO error sequence number in its own little array pointed to by struct files_struct. Not only will that complicate expand_files, it will also require that we tell vfs_sync and similar functions either the file descriptor number or a pointer to the io error sequence counter, because the "struct file" does not tell us the file descriptor number... I'm not sure anyone will like the complexity of the above solution (hi Al). There is another "solution" to consider - redirtying pages when write IO fails. This has the issue of potentially filling up system memory with uncleanable dirty pages and deadlocking the whole system. There does not seem to be a nice solution to this problem, but it does look like a problem we may want to fix. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html