Re: [fuse-devel] Fw: [2.6.26 patch] fuse: fix thinko in max I/O size calucation

Miklos Szeredi <miklos@xxxxxxxxxx> · Fri, 20 Jun 2008 15:38:41 +0200

CC-ing linux-fsdevel, because this issue might be interesting to other
filesystems which allow NFS exporting, and do page cache invalidation.

Brian Wang wrote:
>> Thanks for the quick fix. But I may have a hard one for you.
>>
>> 1. big_writes definitely works now. it also fixed the performance problem 
>> I reported. I think it is related to the 4k reads the patch fixed.
>>
>> 2. The problem is definitely NFS related. If you write some big files via 
>> NFS and read them back right away, it works.  Then you leave it alone for 
>> a few hours, try to read them again, you will get Input/Output error. I 
>> used "-o big_writes, noforget" options.

> More info on this.
>
> Even read from NFS returns IO error, read from local fuse works fine. 
> After waiting for a few hour(you got io errors reading the files you wrote 
> before), if you write a new file, then try to read it back, it takes high 
> CPU and won't finish. looks like it sits in a deadloop.
>

OK, I found the reason for the I/O errors and slowdowns.  Short story:
try the 'kernel_cache' option, it fixed both issues for me.

Long story: NFSv2/3 don't have the concept of an open file, so for
each read, nfsd basically does:

  open file
  read from file
  close file

When opening the file fuse will flush the pages associated with the
inode, unless the 'kernel_cache' option is used.  This in itself
shouldn't be a problem, since the invalidated pages will just be read
again.

The problem comes from the way nfsd does the reading: it uses splice
to reference pages from the filesystem, instead of copying data to a
temporary buffer.

The following can happen:

  - one nfsd thread is doing the read, and is inside the splice code
  - an other nfsd thread is starting the read and calls open on the same inode

The open will invalidate the current page cache for the inode, which
will result in splice returing a short read count.  In an extreme
case, it could return a zero read count.

All this still doesn't result in any errors in most cases, since the
linux read code is built in a way to first do readahead
asynchronously, and only do single page synchronous reads if the page
wasn't read-in on the previous readahead pass.  So mostly in the above
case the short read count is ignored, and the read is retried, but now
a separate 4k read request for each page.  This is the cause of the
slowdown.

However in the rare case that the splice returns zero even for the
single page read, then the linux read logic will take that as a read
error and return -EIO.

While 'kernel_cache' is a good workaround for this issue, it might not
be ideal for all filesystems, because cache invalidation is an
important issue in some cases.  So I'm going to think about how to
solve this properly.

Probably splice should detect, that pages have been invalidated, and
retry the operation.

Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html