While off reading code, I noticed something that didn't look quite right... Look at generic_file_direct_write(), in mm/filemap.c. What the code there is doing is: - dropping the range we're writing to from the page cache (writing it first if necessary), then - doing the write, then - invalidating that range in the pagecache again. The second invalidate_inode_pages2() call is because we aren't holding any locks that prevent _reads_ from pulling data into the pagecache, and we need to make sure the pagecache is consistent with what's on disk - hence, we need to invalidate the range we wrote _after_ our write completes, when new reads will see the correct data. But - if it's as async write, the write hasn't actually completed and the second invalidate_inode_pages2() call is pointless (and it's not happening anywhere else, either). So - perhaps you're going to say "Why not just add a call to invalidate_inode_pages2() in the correct spot, in the async completion path?" Well, invalidate_inode_pages2() drops dirty pages on the floor - you need to to call filemap_write_and_wait_range() first, and be holding something that prevents pages from being redirtied - i_mutex. But we only hold i_mutex for IO submission, it isn't held for the entire duration of the IO (that would suck). This is particularly bad since O_DIRECT granularity is the fs block size - i.e. potentially smaller than PAGE_SIZE. That buffered write that happens while the AIO write is in flight could be going to the same page but different blocks - but one of them is going to end up dropped/overwritten. And another scary thing is the fact that O_DIRECT writes fall back to buffered at the drop of a hat - writes to holes are the main case, but it can be anything the filesystem didn't feel like implementing in the O_DIRECT case (I vaguelly recall there being some weird btrfs case involving some weird extent situation where it would fall back). So a user doesn't have to be intentionally doing O_DIRECT and buffered IO at the same time. Yet _another_ fun fact: I mentioned that for the filemap_write_and_wait_range(); invalidate_inode_pages2() sequence to work we have to be preventing pages from being redirtied. Well, i_mutex does the job for buffered writes, but not page faults - AFAICT page_mkwrite() would have to take i_mutex for this code to not race with page faults, and the default page_mkwrite implementation (filemap_page_mkwrite()) definitely does not. It does _lock_ the page though, so if we had something that combined filemap_write_and_wait_range() with invalidating pages, making sure to have the page still locked when removing it from the page cache - that ought to work. XFS does seem to attempt to get this right - its .page_mkwrite takes the inode XFS_MMAPLOCK_SHARED lock, and the xfs truncate and fallocate code both take XFS_MMAPLOCK_EXCL (truncate and (in particular) fcollapse also need to drop ranges from the page cache, fcollapse is where I first noticed this particular issue). But AFAICT xfs's dio path does _not_ take the correct lock for this to work - although if you look at xfs_file_dio_aio_write() they were clearly thinking about page cache synchronization, so perhaps I'm missing something about how xfs's locking works. BTW, I'd love if someone could just show me that I was mistaken about some of this and it does in fact work, a lot of this locking is damned subtle and it's entirely possible I'm missing something crucial... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html