Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)

Chuck Lever <chuck.lever@xxxxxxxxxx> · Thu, 10 Sep 2009 11:01:10 -0400

On Sep 9, 2009, at 7:15 PM, Steve Dickson wrote:
On 09/09/2009 06:18 PM, Chuck Lever wrote:
On Sep 9, 2009, at 3:42 PM, Trond Myklebust wrote:
On Wed, 2009-09-09 at 15:17 -0400, Chuck Lever wrote:
On Sep 9, 2009, at 2:39 PM, Trond Myklebust wrote:
The old statd still exists in nfs-utils.  The new statd is an  
entirely
separate component.  Distributions can continue to use the old  
statd
as long as they want.  This is a red herring.

Bullshit. If they are adding IPv6 support, then they will have to
upgrade at some point.

I don't see a problem with a distribution upgrade using old statd  
and a
fresh install using new statd.  You have to install a lot of new
components to get NFS/IPv6 support.
What new components that are not already being installed??

You need a kernel that can do NFS/IPv6, you need to install rpcbind  
and libtirpc, you need the new mount command, you need all the user  
space network pieces to manage IPv6, you need to consider firewall and  
address distribution on your local network, and you need statd and  
mountd/exportfs to get NFS/IPv6 support.

Configuring a system for IPv6 support can also be nontrivial, and not  
something people will do on a whim.

I didn't mean to imply that some of these components are not already  
installed.  My point is that the required changes for NFS/IPv6 are  
wide spread, and that most people would opt for installing a new OS on  
their systems to get these features, rather than upgrade all of these  
items piecemeal.

And you have never clearly answered why it wouldn't be enough to  
add a
little code to convert the current on-disk format to sqlite3 when
upgrading to the new statd, if upgradability is truly an important
requirement.  Possibly this is because it eliminates the only real
technical objection you have to using sqlite3 here.
The issue I would have with using sqlite3 is it would add yet another
requirement on nfs-utils... I really don't know how big sqlite3 and/or
sqlite3-devel (possibly needed for builds) packages are but it just
one more thing will be need for nfs-utils to function...

sqlite3.org provides a single source file version of sqlite3 that is  
licensed and designed explicitly for folks to include in their own  
code, without the need for linking a library.  You can even disable a  
number of build time options to reduce object size.

This means that the libsqlite3 and libsqlite3-devel packages would not  
be required on either the build system or the end system, and it  
eliminates the issue of whether libsqlite3.so can be moved to /lib.

Simplicity is another reason. WTF do we need a full SQL  
database, when
all we want to do is store 2 pieces of data (a hostname and a  
cookie)?
It isn't as if this has been a major problem for us previously.

Because we are not storing just a hostname and a cookie.  We are
storing several different data items for each host, and we need to
search over the records, and provide uniqueness constraints, and
handle data conversion (for binary data like the cookie, for string
data like the hostname, and for integers, like the prog/vers/proc
tuple).  We need to store them durably on persistent storage to  
have
some protection against crashes.  These are all things that an
embedded database can do well, and that we therefore don't have to
code ourselves.

Speaking of red herrings. Why are we adding all this crap?

This is a legacy filesystem! We shouldn't not be rewriting NLM/NSM  
from
scratch, just add minimal support for IPv6.

You and Bruce brought up a number of work items related to statd,
including having distinct statd behavior for remotes who are  
clients and
remotes who are servers.  Tom Talpey suggested we needed to send
multiple SM_NOTIFY requests to each host, and use TCP to do it when
possible, and you even specifically encouraged me to read his
connectathon presentation on this.  If Asian countries are driving  
the
IPv6 requirement, why wouldn't they want IDN support as well?
Interoperable NFS/IPv6 support requires TI-RPC.  Plus, NFS/IPv6
practically requires multi-homed NLM/NSM support -- see Alex's RFC  
draft
for details on that.
So a database is needed to accomplish all this?

No, a database is not specifically required.

However, libsqlite3 is a library that contains all of the elements --  
durable on-disk storage, proper data conversion for binary blobs,  
single- and double-width character strings, integers, the ability to  
constrain record uniqueness, the ability to add new data items easily  
to each record, and a facility for collating and searching the host  
records.

sqlite3 is an embedded database, meaning the implementation is  
purposely smaller than a full SQL database, and is designed explicitly  
to have zero database administration requirements.  sqlite3 is  
designed for managing data for long-running network daemons, and it is  
widely used for that purpose.

If there is some other pre-existing code that can do this, I'm open to  
considering it.

Let me also point out that old statd is already broken in a number of
ways, and I certainly haven't heard a lot of complaints about it.   
Our
client NLM has sent "0" as our NSM state number for years, for  
example.
Thus I hardly think there is a lot of risk in making changes here.   
It
can only get better.

I can agree with you here...

IPv6 is used in Asia, where they almost certainly need to use non-
ASCII characters in their hostnames.  Internationalized domain  
names
are stored in double-wide character sets.  To provide reliable  
support
for IDNs in statd, we will have to guarantee somehow that we can  
store
an IDN as a file name (if we want to stay with the current  
scheme), no
matter what file system is used for /var.

So, what's stopping us? These are POSIX filesystems. They can  
store any
filename as long as it doesn't contain '/' or '\0'.

IDNs are UTF16.  /var therefore has to support UTF16 filenames;  
either
byte in a double-byte character can be '/' or '\0'.  That means the
underlying fs implementation has to support UTF16 (FAT32 anyone?),  
and
the system's locale has to be configured correctly.  If we decide  
not to
depend on the file system to support UTF16 filenames, then statd  
has to
be intelligent enough to figure out how to deal with converting UTF16
hostnames before storing them as filenames.  Then, we have to teach
matchhostname() and friends how to deal with double-byte character
strings...
Has this been a problem in the past? How are other implementations
dealing with this? Have they gone to use a db as well?

No, IDNs are recent, but it is reasonable to think that  
internationalized domain names is a feature that would appeal to the  
same folks who are driving the IPv6 requirement.  This is not a hard  
requirement, but it is one reason why statd's current on-disk format  
is not adequate.

Yes, I understand that there are some statd implementations that use a  
database rather than flat files.  statd is nothing if not exactly a  
mechanism for storing structured data across system crashes.  That's  
exactly what databases are for.

Or we just tell sqlite3 that this is a double-byte character  
string, and
let it handle the collation and on-disk storage details for us.

The point is, this is yet another detail we have to either worry  
about
and open code in statd, or we can simply rely on what's already  
provided
in sqlite3.  No one, repeat NO ONE, is arguing that you can't  
implement
these features without sqlite3.  My argument is that we quickly  
bury a
whole bunch of details if we use sqlite3, and can then focus on  
larger
issues.  That's the prime goal of software layering with libraries.
What kind of performance hit will there be (if any)? The nice thing
about a file is you only have to read it once in to a cache verses
doing a number of queries... or can one also cache queries?

sqlite3's performance for the statd application would actually be  
better than what we have today.

Naturally the database is cached in memory, making queries as fast as  
memory reads.  The better performance comes with record insertion and  
deletion.  Today statd does a file create and then an O_SYNC write to  
that file.  This requires synchronous metadata updates to the file  
system to create the new file and create a new directory entry for  
it.  If the directory becomes large, creating a new directory entry  
becomes even slower.  Likewise for record deletion, multiple  
synchronous metadata updates are required to remove the directory  
entry and the file containing the host record.

With sqlite3 (or any database style solution) record insertion and  
deletion can usually be handled with a single O_SYNC write to the  
database file.

You could argue that using sqlite3 means more CPU and memory  
consumption.  Perhaps, but that's a less onerous resource requirement  
than synchronous disk activity, in my view.

We can open code any or all of statd.  In fact the current statd open
codes RPC request creation in socket buffers rather than using  
glibc's
RPC API, and I think we agree that is not an optimal solution.  The
question is: should we duplicate code and bugs by open coding statd's
RPC and data storage?  Or should we pretend to be modern software
engineers, and use widely used and known good code that other people
have written already to handle these details?
I'm all for using moving forward with "modern software" but, as
a common theme with me, I'm always worried about becoming
needlessly complicated or over engineering... which might be
the case with having statd use a db...

Consider what would happen if we open coded all of the details of on- 
disk storage and record searching into statd itself.  I think  
something like sqlite3 is a better and less complex solution than open  
coding because all these details are moved out of statd into a pre- 
existing library, thus making statd itself architecturally simpler,  
and therefore easier to understand and maintain.

The one weakness here is the dependence on SQL.  That makes the statd  
code uglier and more complex than I would like, and is something I  
want to address.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html