Re: [PATCH 4/7] clstated: add routines for a sqlite backend database

Chuck Lever <chuck.lever@xxxxxxxxxx> · Thu, 15 Dec 2011 09:55:59 -0500

On Dec 14, 2011, at 11:15 AM, Jeff Layton wrote:

> On Wed, 14 Dec 2011 10:47:56 -0500
> Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
> 
>> 
>> On Dec 14, 2011, at 10:14 AM, Jeff Layton wrote:
>> 
>>> On Wed, 14 Dec 2011 09:56:32 -0500
>>> Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>>> 
>>>> 
>>>> On Dec 14, 2011, at 8:57 AM, Jeff Layton wrote:
>>>> 
>>>>> Rather than roll our own "storage engine", use sqlite instead. It fits
>>>>> the bill nicely as it does:
>>>>> 
>>>>> - durable on-disk storage
>>>>> - the ability to constrain record uniqueness
>>>>> - a facility for collating and searching the host records
>>>> 
>>>> It should also allow multiple processes (possibly on multiple cluster nodes) to access the on-disk data safely and atomically.  No need to invent a file locking scheme from scratch.
>>>> 
>>> 
>>> Indeed. That's one of the main reasons I chose sqlite here.
>>> 
>>>>> ...it does add a build dependency to nfs-utils, but almost all modern
>>>>> distros provide those packages.
>>>>> 
>>>>> The current incarnation of this code dynamically links against a
>>>>> provided sqlite library, but we could also consider including their
>>>>> single-file "amalgamation" to reduce dependencies (though with all
>>>>> the caveats that that entails).
>>>>> 
>>>>> Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
>>>>> ---
>>>>> utils/clstated/Makefile.am |    4 +-
>>>>> utils/clstated/clstated.c  |   16 ++-
>>>>> utils/clstated/sqlite.c    |  351 ++++++++++++++++++++++++++++++++++++++++++++
>>>>> utils/clstated/sqlite.h    |   27 ++++
>>>>> 4 files changed, 393 insertions(+), 5 deletions(-)
>>>>> create mode 100644 utils/clstated/sqlite.c
>>>>> create mode 100644 utils/clstated/sqlite.h
>>>>> 
>>>>> diff --git a/utils/clstated/Makefile.am b/utils/clstated/Makefile.am
>>>>> index 9937353..d1cef3c 100644
>>>>> --- a/utils/clstated/Makefile.am
>>>>> +++ b/utils/clstated/Makefile.am
>>>>> @@ -6,9 +6,9 @@
>>>>> AM_CFLAGS	+= -D_LARGEFILE64_SOURCE
>>>>> sbin_PROGRAMS	= clstated
>>>>> 
>>>>> -clstated_SOURCES = clstated.c
>>>>> +clstated_SOURCES = clstated.c sqlite.c
>>>>> 
>>>>> -clstated_LDADD = ../../support/nfs/libnfs.a $(LIBEVENT)
>>>>> +clstated_LDADD = ../../support/nfs/libnfs.a $(LIBEVENT) $(LIBSQLITE)
>>>>> 
>>>>> MAINTAINERCLEANFILES = Makefile.in
>>>>> 
>>>>> diff --git a/utils/clstated/clstated.c b/utils/clstated/clstated.c
>>>>> index aab09a7..95ac696 100644
>>>>> --- a/utils/clstated/clstated.c
>>>>> +++ b/utils/clstated/clstated.c
>>>>> @@ -36,6 +36,7 @@
>>>>> 
>>>>> #include "xlog.h"
>>>>> #include "nfslib.h"
>>>>> +#include "sqlite.h"
>>>>> 
>>>>> #ifndef PIPEFS_DIR
>>>>> #define PIPEFS_DIR NFS_STATEDIR "/rpc_pipefs"
>>>>> @@ -118,21 +119,25 @@ clstate_pipe_reopen(struct clstate_client *clnt)
>>>>> static void
>>>>> clstate_create(struct clstate_client *clnt)
>>>>> {
>>>>> +	int ret;
>>>>> 	ssize_t bsize, wsize;
>>>>> 	struct clstate_msg *cmsg = &clnt->cl_msg;
>>>>> 
>>>>> -	xlog(D_GENERAL, "%s: create client record", __func__);
>>>>> +	xlog(D_GENERAL, "%s: create client record. cm_addr=%s", __func__,
>>>>> +			cmsg->cm_addr);
>>>>> 
>>>>> -	/* FIXME: create client record on storage here */
>>>>> +	ret = clstate_insert_client(cmsg->cm_addr, cmsg->cm_u.cm_id,
>>>>> +					cmsg->cm_len);
>>>>> 
>>>>> 	/* set up reply */
>>>>> -	cmsg->cm_status = 0;
>>>>> +	cmsg->cm_status = ret ? -EREMOTEIO : ret;
>>>>> 	cmsg->cm_len = 0;
>>>>> 	memset(cmsg->cm_addr, 0, sizeof(cmsg->cm_addr) +
>>>>> 		sizeof(cmsg->cm_u.cm_id));
>>>>> 
>>>>> 	bsize = sizeof(*cmsg);
>>>>> 
>>>>> +	xlog(D_GENERAL, "Doing downcall with status %d", cmsg->cm_status);
>>>>> 	wsize = atomicio((void *)write, clnt->cl_fd, cmsg, bsize);
>>>>> 	if (wsize != bsize) {
>>>>> 		xlog(L_ERROR, "%s: problem writing to clstate pipe (%ld): %m",
>>>>> @@ -231,6 +236,11 @@ main(int argc, char **argv)
>>>>> 	}
>>>>> 
>>>>> 	/* set up storage db */
>>>>> +	rc = clstate_maindb_init();
>>>>> +	if (rc) {
>>>>> +		xlog(L_ERROR, "Failed to open main database: %d", rc);
>>>>> +		goto out;
>>>>> +	}
>>>>> 
>>>>> 	/* set up event handler */
>>>>> 	fd = clstate_pipe_open(&clnt);
>>>>> diff --git a/utils/clstated/sqlite.c b/utils/clstated/sqlite.c
>>>>> new file mode 100644
>>>>> index 0000000..ae83634
>>>>> --- /dev/null
>>>>> +++ b/utils/clstated/sqlite.c
>>>>> @@ -0,0 +1,351 @@
>>>>> +/*
>>>>> + * Copyright (C) 2011  Red Hat, Jeff Layton <jlayton@xxxxxxxxxx>
>>>>> + *
>>>>> + * This program is free software; you can redistribute it and/or
>>>>> + * modify it under the terms of the GNU General Public License
>>>>> + * as published by the Free Software Foundation; either version 2
>>>>> + * of the License, or (at your option) any later version.
>>>>> + *
>>>>> + * This program is distributed in the hope that it will be useful,
>>>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>>>>> + * GNU General Public License for more details.
>>>>> + *
>>>>> + * You should have received a copy of the GNU General Public License
>>>>> + * along with this program; if not, write to the Free Software
>>>>> + * Foundation, Inc., 51 Franklin Street, Fifth Floor,
>>>>> + * Boston, MA 02110-1301, USA.
>>>>> + */
>>>>> +
>>>>> +/*
>>>>> + * Explanation:
>>>>> + *
>>>>> + * This file contains the code to manage the sqlite backend database for the
>>>>> + * clstated upcall daemon.
>>>>> + *
>>>>> + * The main database is called main.sqlite and contains the following tables:
>>>>> + *
>>>>> + * parameters: simple key/value pairs for storing database info
>>>>> + *
>>>>> + * addresses: list of server-side addresses recorded in the db, along with
>>>>> + * 	      the filenames of their respective db files.
>>>>> + *
>>>>> + * The daemon attaches to each server-address database as needed. Each
>>>>> + * server-address database has the following tables:
>>>> 
>>>> Why do you keep separate database files?  There are ways to store all of this data in a single database using multiple tables.  I'm happy to help you with table design if you like.
>>>> 
>>> 
>>> As you pointed out above, sqlite allows us to share the db directory
>>> across a cluster of nodes. Using different files here will allow us to
>>> reduce lock contention in the cluster. If we put each server address DB
>>> in a separate file, then only the node that currently owns that address
>>> should need to touch it under normal circumstances.
>> 
>> This seems like premature optimization.  A single file is much simpler and I'm guessing would be a more standard arrangement of tables.  If cluster contention avoidance is a key design point, it's worth calling out in the documenting comment, because this multi-file arrangement is not a typical use of sqlite3.
>> 
>> Why do you predict that the database will be under contention?  Isn't it the case that sqlite3 can allow multiple accessors to a single database file?
>> 
> 
> Since the grace periods will need to be coordinated, then it's likely
> on a server reboot that clients will be reestablishing state with
> multiple nodes in the cluster at a time. That means at least some
> contention for the DB. Whether it's problematic or not is an open
> question at this point, but why not avoid that if we can?

Yes, there may be contention during reclaim.  However, I submit that we don't understand the details of that well enough to solidify a design for handling that workload scalably at this point.  I recommend some simple prototyping and careful modeling to understand exactly what's going to happen in such a scenario.  Then we can approach a client ID database design with good information about what needs to scale and what can be designed for convenience.  :-)

Just a thought.

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html