NFS on GFS architectural issues / problems

Riaan van Niekerk <riaan@xxxxxxxxxxxxxx> · Mon, 21 Aug 2006 09:42:02 +0200

hi Bob and others

I found on the Red Hat 108 Developer Portal the following GFS1/GFS2 
design document which details amongst others, some of the issues with 
NFS on GFS:
https://rpeterso.108.redhat.com/servlets/ProjectDocumentView?documentID=99

(I see it was sent to this list over a year ago, but I never found it 
while searching through the archives. it has a lot of good information 
in it)

It has a disclaimer: Some of the comments
are no longer applicable due to design changes

My question to you or anyone who is familiar with NFS on GFS, or GFS in 
general, which of the following are still valid issues for the current 
(6.1u4) version of GFS. If all or most of them still apply, I can use 
this as motivation for my customer to strongly consider going off NFS on 
GFS. Removing the NFS from our GFS cluster has been on the cards for 
quite a while, but has not gained momentum due to lack of information on 
the performance gains of such a move (very difficult to gage) or the 
architectural problems/limitations of NFS on GFS (for which the 
following extract is spot-on).

Note - can you consider adding a link to this document from your FAQ?

+++++++++

o  NFS Support

A GFS filesystem can be exported through NFS to other
nodes.  There
are a number of issues with NFS on top of a cluster
filesystem,
though.

1) Filehandle misses

   When a NFS request comes into the server, it's up to
the filesystem
   (and a few Linux helper routines) to map the NFS
filehandle to the
   correct inode.  Doing that is easy if the inode is
already in the
   node's cache.  The tricky part is when the
filesystem must read in
   the inode from the disk.  There is nothing in the
filehandle that
   anchors the inode into the filesystem (such as a
glock on a
   directory that contains an entry pointing to the
inode), so a lot
   more care has to taken to make sure the block really
contains a
   valid inode.  (See the section on the proposed new
RG formats.)

   It's also non-trivial to handle inode migration in
GFS when a NFS
   server is running.  There is no centralized data
structure that can
   map filehandles into inodes (such a structure would be a
   scalability/performance bottleneck).  It's difficult
to find a
   representation of the inode that could be used to
quickly find it
   even in the face of the inode changing blocks.

   Another problem is that filehandle requests can come
in random
   times for inodes that don't exist anymore or are in
the process of
   being recreated.  This can break optimizations based
on ideas like
   "since this node in the process of creating this
inode, it are
   the only one who knows about its locks".  GFS has
suffered from
   these mis-optimizations in the past.  From what I've
seen, I believe
   OCFS2 currently has problems like this, too.

2) Readdir

   Linux has an interesting mechanism to do handle
readdir() requests.
   The VFS (or NFSD) passes the filesystem a request
containing not
   only the directory and offset to be read, but a
filldir function to
   call for each entry found.  So, the filesystem
doesn't directly
   fill in a buffer of entries, but calls an arbitrary
routine that
   can either put the entries into a buffer or do some
other type of
   processing on them.  This is a powerful concept, but
can be easily
   misused.

   I believe that NFSD's use of it is problematic at
best.  The
   filldir routine used by NFSD for the readdirplus NFS
procedure
   calls back into the filesystem to do a lookup and
stat() on the
   inode pointed to by the entry.  This call is painful
because of
   GFS' locking.  gfs_readdir() must call filldir with
the directory
   lock held so that it doesn't lose its place in the
directory.  The
   stat() that the filldir routine does causes the
inode's lock to be
   acquired.  Because concurrent inode locks must
always be acquired
   in ascending numerical order and the filldir routine
forces an
   ordering that might be something other than that,
there is a
   deadlock potential.

   GFS detects when NFSD calls its readdir and switches
to a routine
   that avoids calling the filldir routine with the
lock held.  It's
   not as efficient, but it avoids the deadlock.  It'd
be nice if
   there was a better way to do the detection, though.
 (The code
   currently looks at the program's name.)

3) FCNTL locking

   There are a huge number of issues with acquiring and
failing over
   fcntl()-style locks when there are multiple GFS
heads exporting
   NFS.  GFS pretty much ignores them right now.  A
good place to
   start would be to change NFSD so it actually passes
fcntl calls
   down into the filesystem.

4) NFSv4

   NFSv4 requires all sorts of changes to GFS in order
for them to
   work together.  Op locks being one I can remember at
the moment.
   I think I've repressed my memories of the others.

++++++++
begin:vcard
fn:Riaan van Niekerk
n:van Niekerk;Riaan
org:Obsidian Systems;Obsidian Red Hat Consulting
email;internet:riaan@xxxxxxxxxxxxxx
title:Systems Architect
tel;work:+27 11 792 6500
tel;fax:+27 11 792 6522
tel;cell:+27 82 921 8768
x-mozilla-html:FALSE
url:http://www.obsidian.co.za
version:2.1
end:vcard

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster