CC-ing linux-fsdevel, because this issue might be interesting to other filesystems which allow NFS exporting, and do page cache invalidation. Brian Wang wrote: >> Thanks for the quick fix. But I may have a hard one for you. >> >> 1. big_writes definitely works now. it also fixed the performance problem >> I reported. I think it is related to the 4k reads the patch fixed. >> >> 2. The problem is definitely NFS related. If you write some big files via >> NFS and read them back right away, it works. Then you leave it alone for >> a few hours, try to read them again, you will get Input/Output error. I >> used "-o big_writes, noforget" options. > More info on this. > > Even read from NFS returns IO error, read from local fuse works fine. > After waiting for a few hour(you got io errors reading the files you wrote > before), if you write a new file, then try to read it back, it takes high > CPU and won't finish. looks like it sits in a deadloop. > OK, I found the reason for the I/O errors and slowdowns. Short story: try the 'kernel_cache' option, it fixed both issues for me. Long story: NFSv2/3 don't have the concept of an open file, so for each read, nfsd basically does: open file read from file close file When opening the file fuse will flush the pages associated with the inode, unless the 'kernel_cache' option is used. This in itself shouldn't be a problem, since the invalidated pages will just be read again. The problem comes from the way nfsd does the reading: it uses splice to reference pages from the filesystem, instead of copying data to a temporary buffer. The following can happen: - one nfsd thread is doing the read, and is inside the splice code - an other nfsd thread is starting the read and calls open on the same inode The open will invalidate the current page cache for the inode, which will result in splice returing a short read count. In an extreme case, it could return a zero read count. All this still doesn't result in any errors in most cases, since the linux read code is built in a way to first do readahead asynchronously, and only do single page synchronous reads if the page wasn't read-in on the previous readahead pass. So mostly in the above case the short read count is ignored, and the read is retried, but now a separate 4k read request for each page. This is the cause of the slowdown. However in the rare case that the splice returns zero even for the single page read, then the linux read logic will take that as a read error and return -EIO. While 'kernel_cache' is a good workaround for this issue, it might not be ideal for all filesystems, because cache invalidation is an important issue in some cases. So I'm going to think about how to solve this properly. Probably splice should detect, that pages have been invalidated, and retry the operation. Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html