Re: Random I/O over NFS has horrible performance due to small I/O transfers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[BTW: every time I reply to you, the e-mail to your address bounces. I assume you are able to see my replies through the two reflectors that are cc'd].

On Jan 29, 2010, at 11:57 AM, Quentin Barnes wrote:
If the flush call has no data to flush, the GETATTR is pointless.
The file's cached attribute information is already as valid as can
ever be hoped for with NFS and its life is limited by the normal
attribute timeout.  On the other hand, if the flush has cached data
to write out, I would expect the WRITE will return the updated
attributes with the post-op attribute status, again making the
GETATTR on close pointless.  Can you explain what I'm not following?

The Linux NFS client does not use the post-op attributes to update the
cached attributes for the file.  Because there is no guarantee that
the WRITE replies will return in the same order the WRITEs were sent,
it's simply not reliable. If the last reply to be received was for an older write, then the mtime and ctime (and possibly the size) would be stale, and would trigger a false data cache invalidation the next time
a full inode validation is done.

Ah, yes, the out of order WRITE replies problem.  I knew I was
forgetting something.

This may be a stupid question, but why not use the post-op attribute
information to update the inode whenever the fattr mtime exceeds the
inode mtime and just ignore using the post-op update all the other
times since that would indicate an out of order arrival?

I seem to recall that older versions of our client used to do that, and we may still in certain cases. Take a look at the post-op attribute handling near nfs_update_inode() in fs/nfs/inode.c.

One problem is that WRITE replies are entirely asynchronous with application writes, and are handled in a different kernel context (soft IRQ? I can't remember). Serializing updates to the attribute cache between different contexts is difficult. The solution used today means that attributes are updated only in synchronous contexts, so we can get a handle on the many race conditions without causing deadlocks.

For instance, post-op attributes can indicate that the client has to invalidate the page cache for a file. That's tricky to do correctly in a context that can't sleep, since invalidating a page needs to take the page lock. Setting NFS_INO_INVALID_ATTR is one way to preserve that indication until the client is running in a context where a data cache invalidation is safe to do.

(Of course
other fields would want to be checked to see if the file suddenly
changed other state information warranting general invalidation.)
I would assume that there are other out of order arrivals for other
op replies that prevent such a trivial algorithm?

This out-of-order post-op attribute data invalidating cache sounds
like a well-known problem that people have been trying to solve for
a long time or have proved that it can't be solved.  If there's a
white paper you can point me to that discuss the problem at length,
I'd like to read it.

I don't know of one.

However, I tore into the code to better understand what was triggering
the GETATTR on close.  A close(2) does two things, a flush
(nfs_file_flush) and a release (nfs_file_release). I had thought the GETATTR was happening as part of the nfs_file_flush(). It's not. The GETATTR is triggered by the nfs_file_release(). As part of it doing a
put_nfs_open_context() -> nfs_close_context() ->
nfs_revalidate_inode(),
that triggers the NFS_PROTO(inode)->getattr()!

I'm not sure, but I suspect that O_DIRECT files take the
put_nfs_open_context() path that results in the extraneous GETATTR
on close(2) because filp->private_data is non-NULL where for regular
files it's NULL.  Is that right?  If so, can this problem be easily
fixed?

I'm testing a patch to use an asynchronous close for O_DIRECT files.
This will skip the GETATTR for NFSv2/v3 O_DIRECT files, and avoid
waiting for the CLOSE for NFSv4 O_DIRECT files.

When you're ready for external testing, will you be publishing it here
on the NFS mailing list?  Any guess when it might be ready?

I have a pair of patches in my kernel git repo at git.linux-nfs.org (cel). One fixes close, the other attempts to address open. I'm still working on the open part. I'm hoping to get these into 2.6.34. I'm sure these are not working quite right yet, but you might want to review the work, as it probably looks very similar to what you've already done internally.

I've also noticed that our client still sends a lot of ACCESS requests in the simple open-write-close use case. Too many ACCESS requests seem to be a perennial problem. I'm going to look at that next.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux