Re: [PATCH 4/7] clstated: add routines for a sqlite backend database

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 14 Dec 2011 09:56:32 -0500
Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:

> 
> On Dec 14, 2011, at 8:57 AM, Jeff Layton wrote:
> 
> > Rather than roll our own "storage engine", use sqlite instead. It fits
> > the bill nicely as it does:
> > 
> > - durable on-disk storage
> > - the ability to constrain record uniqueness
> > - a facility for collating and searching the host records
> 
> It should also allow multiple processes (possibly on multiple cluster nodes) to access the on-disk data safely and atomically.  No need to invent a file locking scheme from scratch.
> 

Indeed. That's one of the main reasons I chose sqlite here.

> > ...it does add a build dependency to nfs-utils, but almost all modern
> > distros provide those packages.
> > 
> > The current incarnation of this code dynamically links against a
> > provided sqlite library, but we could also consider including their
> > single-file "amalgamation" to reduce dependencies (though with all
> > the caveats that that entails).
> > 
> > Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
> > ---
> > utils/clstated/Makefile.am |    4 +-
> > utils/clstated/clstated.c  |   16 ++-
> > utils/clstated/sqlite.c    |  351 ++++++++++++++++++++++++++++++++++++++++++++
> > utils/clstated/sqlite.h    |   27 ++++
> > 4 files changed, 393 insertions(+), 5 deletions(-)
> > create mode 100644 utils/clstated/sqlite.c
> > create mode 100644 utils/clstated/sqlite.h
> > 
> > diff --git a/utils/clstated/Makefile.am b/utils/clstated/Makefile.am
> > index 9937353..d1cef3c 100644
> > --- a/utils/clstated/Makefile.am
> > +++ b/utils/clstated/Makefile.am
> > @@ -6,9 +6,9 @@
> > AM_CFLAGS	+= -D_LARGEFILE64_SOURCE
> > sbin_PROGRAMS	= clstated
> > 
> > -clstated_SOURCES = clstated.c
> > +clstated_SOURCES = clstated.c sqlite.c
> > 
> > -clstated_LDADD = ../../support/nfs/libnfs.a $(LIBEVENT)
> > +clstated_LDADD = ../../support/nfs/libnfs.a $(LIBEVENT) $(LIBSQLITE)
> > 
> > MAINTAINERCLEANFILES = Makefile.in
> > 
> > diff --git a/utils/clstated/clstated.c b/utils/clstated/clstated.c
> > index aab09a7..95ac696 100644
> > --- a/utils/clstated/clstated.c
> > +++ b/utils/clstated/clstated.c
> > @@ -36,6 +36,7 @@
> > 
> > #include "xlog.h"
> > #include "nfslib.h"
> > +#include "sqlite.h"
> > 
> > #ifndef PIPEFS_DIR
> > #define PIPEFS_DIR NFS_STATEDIR "/rpc_pipefs"
> > @@ -118,21 +119,25 @@ clstate_pipe_reopen(struct clstate_client *clnt)
> > static void
> > clstate_create(struct clstate_client *clnt)
> > {
> > +	int ret;
> > 	ssize_t bsize, wsize;
> > 	struct clstate_msg *cmsg = &clnt->cl_msg;
> > 
> > -	xlog(D_GENERAL, "%s: create client record", __func__);
> > +	xlog(D_GENERAL, "%s: create client record. cm_addr=%s", __func__,
> > +			cmsg->cm_addr);
> > 
> > -	/* FIXME: create client record on storage here */
> > +	ret = clstate_insert_client(cmsg->cm_addr, cmsg->cm_u.cm_id,
> > +					cmsg->cm_len);
> > 
> > 	/* set up reply */
> > -	cmsg->cm_status = 0;
> > +	cmsg->cm_status = ret ? -EREMOTEIO : ret;
> > 	cmsg->cm_len = 0;
> > 	memset(cmsg->cm_addr, 0, sizeof(cmsg->cm_addr) +
> > 		sizeof(cmsg->cm_u.cm_id));
> > 
> > 	bsize = sizeof(*cmsg);
> > 
> > +	xlog(D_GENERAL, "Doing downcall with status %d", cmsg->cm_status);
> > 	wsize = atomicio((void *)write, clnt->cl_fd, cmsg, bsize);
> > 	if (wsize != bsize) {
> > 		xlog(L_ERROR, "%s: problem writing to clstate pipe (%ld): %m",
> > @@ -231,6 +236,11 @@ main(int argc, char **argv)
> > 	}
> > 
> > 	/* set up storage db */
> > +	rc = clstate_maindb_init();
> > +	if (rc) {
> > +		xlog(L_ERROR, "Failed to open main database: %d", rc);
> > +		goto out;
> > +	}
> > 
> > 	/* set up event handler */
> > 	fd = clstate_pipe_open(&clnt);
> > diff --git a/utils/clstated/sqlite.c b/utils/clstated/sqlite.c
> > new file mode 100644
> > index 0000000..ae83634
> > --- /dev/null
> > +++ b/utils/clstated/sqlite.c
> > @@ -0,0 +1,351 @@
> > +/*
> > + * Copyright (C) 2011  Red Hat, Jeff Layton <jlayton@xxxxxxxxxx>
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License
> > + * as published by the Free Software Foundation; either version 2
> > + * of the License, or (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write to the Free Software
> > + * Foundation, Inc., 51 Franklin Street, Fifth Floor,
> > + * Boston, MA 02110-1301, USA.
> > + */
> > +
> > +/*
> > + * Explanation:
> > + *
> > + * This file contains the code to manage the sqlite backend database for the
> > + * clstated upcall daemon.
> > + *
> > + * The main database is called main.sqlite and contains the following tables:
> > + *
> > + * parameters: simple key/value pairs for storing database info
> > + *
> > + * addresses: list of server-side addresses recorded in the db, along with
> > + * 	      the filenames of their respective db files.
> > + *
> > + * The daemon attaches to each server-address database as needed. Each
> > + * server-address database has the following tables:
> 
> Why do you keep separate database files?  There are ways to store all of this data in a single database using multiple tables.  I'm happy to help you with table design if you like.
> 

As you pointed out above, sqlite allows us to share the db directory
across a cluster of nodes. Using different files here will allow us to
reduce lock contention in the cluster. If we put each server address DB
in a separate file, then only the node that currently owns that address
should need to touch it under normal circumstances.

> > + *
> > + * clients: one column containing a BLOB with the clstate as sent by the client
> > + * 	    and a timestamp (in epoch seconds) of when the record was
> > + * 	    established
> 
> I'm not yet clear on what's in cm_addr, but it seems to me that if you store each client's data as structured fields instead of a single BLOB, it will be much easier to use a generic tool like the sqlite3 command to debug the database.  That gives the observability and transparency of using a flat file with all the advantages of a database.
> 
> Plus, you get automatic data conversion; a one-to-one mapping between data on-disk and data in any endian-ness or word length.
> 

cm_addr is the *server's* address. One key point here is that (at least
in v4.0), clients do not SETCLIENTID against a server, but rather
against the server's address. If the client needs to talk to another
address it must call SETCLIENTID again. So, it's reasonable to track
the client names on a per-server address basis to reduce DB contention
in a clustered configuration.

> > + *
> > + * FIXME: should we also record the fsid being accessed?
> 
> I'm not exactly sure what additional data might be needed to support NFSv4 migration, for example.  However, using a database means it's relatively painless to add new columns in the future without having to provision them now.
> 

Right. It's just a matter of adding the correct column and somehow
populating them for existing entries. Making this extensible is a
design goal here. It's a complex enough problem, that I'm fairly
certain I'll get something wrong on the first pass...

> > + */
> > +
> > +#ifdef HAVE_CONFIG_H
> > +#include "config.h"
> > +#endif /* HAVE_CONFIG_H */
> > +
> > +#include <errno.h>
> > +#include <event.h>
> > +#include <stdbool.h>
> > +#include <string.h>
> > +#include <sys/stat.h>
> > +#include <sys/types.h>
> > +#include <fcntl.h>
> > +#include <unistd.h>
> > +#include <sqlite3.h>
> > +#include <linux/limits.h>
> > +#include <linux/nfsd/clstate.h>
> > +
> > +#include "xlog.h"
> > +
> > +#define CLSTATE_SQLITE_SCHEMA_VERSION 1
> > +
> > +#ifndef CLSTATE_DIR
> > +#define CLSTATE_DIR NFS_STATEDIR "/clstate"
> > +#endif
> > +
> > +/* in milliseconds */
> > +#define CLSTATE_SQLITE_BUSY_TIMEOUT 10000
> > +
> > +/* private data structures */
> > +
> > +/* global variables */
> > +
> > +/* top level DB directory */
> > +static char *clstate_topdir = CLSTATE_DIR;
> > +
> > +/* reusable pathname and sql command buffer */
> > +static char buf[PATH_MAX];
> > +
> > +/* global database handle */
> > +static sqlite3 *dbh;
> > +
> > +/* forward declarations */
> > +
> > +/* '.' is not allowed in dbnames -- convert them to '-' */
> > +static void
> > +addr_to_dbname(const char *src, char *dst)
> > +{
> > +	while(*src) {
> > +		if (*src == '.')
> > +			*dst = '-';
> > +		else
> > +			*dst = *src;
> > +		++src;
> > +		++dst;
> > +	}
> > +	*dst = '\0';
> > +}
> > +
> > +void
> > +clstate_set_topdir(char *topdir)
> > +{
> > +	clstate_topdir = topdir;
> > +}
> > +
> > +/* make a directory, ignoring EEXIST errors unless it's not a directory */
> > +static int
> > +mkdir_if_not_exist(char *dirname)
> > +{
> > +	int ret;
> > +	struct stat statbuf;
> > +
> > +	ret = mkdir(dirname, S_IRWXU);
> > +	if (ret && errno != EEXIST)
> > +		return -errno;
> > +
> > +	ret = stat(dirname, &statbuf);
> > +	if (ret)
> > +		return -errno;
> > +
> > +	if (!S_ISDIR(statbuf.st_mode))
> > +		ret = -ENOTDIR;
> > +
> > +	return ret;
> > +}
> > +
> > +/*
> > + * Open the "main" database, and attempt to initialize it by creating the
> > + * parameters table and inserting the schema version into it. Ignore any errors
> > + * from that, and then attempt to select the version out of it again. If the
> > + * version appears wrong, then assume that the DB is corrupt or has been
> > + * upgraded, and return an error.
> > + */
> > +int
> > +clstate_maindb_init(void)
> > +{
> > +	int ret;
> > +	char *err = NULL;
> > +	sqlite3_stmt *stmt = NULL;
> > +
> > +	ret = mkdir_if_not_exist(clstate_topdir);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = snprintf(buf, PATH_MAX - 1, "%s/main.sqlite", clstate_topdir);
> > +	if (ret < 0)
> > +		return ret;
> > +
> > +	buf[PATH_MAX - 1] = '\0';
> > +
> > +	ret = sqlite3_open(buf, &dbh);
> > +	if (ret != SQLITE_OK) {
> > +		xlog(D_GENERAL, "Unable to open main database: %d", ret);
> > +		return ret;
> > +	}
> > +
> > +	ret = sqlite3_busy_timeout(dbh, CLSTATE_SQLITE_BUSY_TIMEOUT);
> > +	if (ret != SQLITE_OK) {
> > +		xlog(D_GENERAL, "Unable to set sqlite busy timeout: %d", ret);
> > +		goto out_err;
> > +	}
> > +
> > +	/* Try to create table */
> > +	ret = sqlite3_exec(dbh, "CREATE TABLE IF NOT EXISTS parameters "
> > +				"(key TEXT PRIMARY KEY, value TEXT);",
> > +				NULL, NULL, &err);
> > +	if (ret != SQLITE_OK) {
> > +		xlog(D_GENERAL, "Unable to create parameter table: %d", ret);
> > +		goto out_err;
> > +	}
> > +
> > +	/* insert version into table -- ignore error if it fails */
> > +	ret = snprintf(buf, sizeof(buf),
> > +		       "INSERT OR IGNORE INTO parameters values (\"version\", "
> > +		       "\"%d\");", CLSTATE_SQLITE_SCHEMA_VERSION);
> > +	if (ret < 0) {
> > +		goto out_err;
> > +	} else if ((size_t)ret >= sizeof(buf)) {
> > +		ret = -EINVAL;
> > +		goto out_err;
> > +	}
> > +
> > +	ret = sqlite3_exec(dbh, (const char *)buf, NULL, NULL, &err);
> > +	if (ret != SQLITE_OK) {
> > +		xlog(D_GENERAL, "Unable to insert into parameter table: %d",
> > +				ret);
> > +		goto out_err;
> > +	}
> > +
> > +	ret = sqlite3_prepare_v2(dbh,
> > +		"SELECT value FROM parameters WHERE key == \"version\";",
> > +		 -1, &stmt, NULL);
> > +	if (ret != SQLITE_OK) {
> > +		xlog(D_GENERAL, "Unable to prepare select statement: %d", ret);
> > +		goto out_err;
> > +	}
> > +
> > +	/* check schema version */
> > +	ret = sqlite3_step(stmt);
> > +	if (ret != SQLITE_ROW) {
> > +		xlog(D_GENERAL, "Select statement execution failed: %s",
> > +				sqlite3_errmsg(dbh));
> > +		goto out_err;
> > +	}
> > +
> > +	/* process SELECT result */
> > +	ret = sqlite3_column_int(stmt, 0);
> > +	if (ret != CLSTATE_SQLITE_SCHEMA_VERSION) {
> > +		xlog(L_ERROR, "Unsupported database schema version! "
> > +			"Expected %d, got %d.",
> > +			CLSTATE_SQLITE_SCHEMA_VERSION, ret);
> > +		ret = -EINVAL;
> > +		goto out_err;
> > +	}
> > +
> > +	sqlite3_free(err);
> > +	sqlite3_finalize(stmt);
> > +	return 0;
> > +
> > +out_err:
> > +	if (err) {
> > +		xlog(L_ERROR, "sqlite error: %s", err);
> > +		sqlite3_free(err);
> > +	}
> > +	sqlite3_finalize(stmt);
> > +	sqlite3_close(dbh);
> > +	return ret;
> > +}
> > +
> > +static int
> > +clstate_db_attach(char *dbname)
> > +{
> > +	int ret;
> > +	char *err;
> > +
> > +	/* convert address string into valid filename */
> > +	ret = snprintf(buf, sizeof(buf),
> > +			"ATTACH DATABASE \"%s/%s.sqlite\" as \"%s\";",
> > +			clstate_topdir, dbname, dbname);
> > +	if (ret < 0) {
> > +		return ret;
> > +	} else if ((size_t)ret >= sizeof(buf)) {
> > +		ret = -EINVAL;
> > +		return ret;
> > +	}
> > +
> > +	/* attach to new DB in filename */
> > +	ret = sqlite3_exec(dbh, buf, NULL, NULL, &err);
> > +	if (ret != SQLITE_OK)
> > +		xlog(L_ERROR, "Unable to attach database %s: %s", dbname, err);
> > +
> > +	xlog(D_GENERAL, "Attached database %s", dbname);
> > +	return ret;
> > +}
> > +
> > +static int
> > +clstate_db_detach(char *dbname)
> > +{
> > +	int ret;
> > +	char *err;
> > +
> > +	/* convert address string into valid filename */
> > +	ret = snprintf(buf, PATH_MAX - 1, "DETACH DATABASE \"%s\";", dbname);
> > +	if (ret < 0) {
> > +		return ret;
> > +	} else if ((size_t)ret >= sizeof(buf)) {
> > +		ret = -EINVAL;
> > +		return ret;
> > +	}
> > +
> > +	/* attach to new DB in filename */
> > +	ret = sqlite3_exec(dbh, buf, NULL, NULL, &err);
> > +	if (ret != SQLITE_OK)
> > +		xlog(L_ERROR, "Unable to detach database %s: %s", dbname, err);
> > +
> > +	xlog(D_GENERAL, "Detached database %s", dbname);
> > +	return ret;
> > +}
> > +
> > +/*
> > + * Create a client record
> > + */
> > +int
> > +clstate_insert_client(const unsigned char *addr, const unsigned char *clname,
> > +			const size_t namelen)
> > +{
> > +	int ret;
> > +	char *err = NULL;
> > +	char dbname[CLSTATE_MAX_ADDRESS_LEN];
> > +	sqlite3_stmt *stmt = NULL;
> > +
> > +	addr_to_dbname((const char *)addr, dbname);
> > +
> > +	ret = clstate_db_attach(dbname);
> > +	if (ret != SQLITE_OK)
> > +		return ret;
> > +
> > +	snprintf(buf, sizeof(buf), "CREATE TABLE IF NOT EXISTS '%s'.clients "
> > +				   "(id BLOB PRIMARY KEY, time INTEGER);",
> > +				   dbname);
> > +	if (ret < 0) {
> > +		goto out_detach;
> > +	} else if ((size_t)ret >= sizeof(buf)) {
> > +		ret = -EINVAL;
> > +		goto out_detach;
> > +	}
> > +
> > +	ret = sqlite3_exec(dbh, buf, NULL, NULL, &err);
> > +	if (ret != SQLITE_OK) {
> > +		xlog(L_ERROR, "Unable to create table: %s", err);
> > +		goto out_detach;
> > +	}
> > +
> > +	ret = snprintf(buf, sizeof(buf),
> > +			"INSERT OR REPLACE INTO '%s'.clients VALUES (?, "
> > +			"strftime('%%s', 'now'));", dbname);
> > +	if (ret < 0) {
> > +		goto out_detach;
> > +	} else if ((size_t)ret >= sizeof(buf)) {
> > +		ret = -EINVAL;
> > +		goto out_detach;
> > +	}
> > +
> > +	ret = sqlite3_prepare_v2(dbh, buf, ret + 1, &stmt, NULL);
> > +	if (ret != SQLITE_OK) {
> > +		xlog(D_GENERAL, "Statement prepare failed: %d", ret);
> > +		goto out_err;
> > +	}
> > +
> > +	ret = sqlite3_bind_blob(stmt, 1, (const void *)clname, namelen,
> > +				SQLITE_STATIC);
> > +	if (ret != SQLITE_OK) {
> > +		xlog(D_GENERAL, "Bind blob failed: %d", ret);
> > +		goto out_err;
> > +	}
> > +
> > +	ret = sqlite3_step(stmt);
> > +	if (ret == SQLITE_DONE)
> > +		ret = SQLITE_OK;
> > +	else
> > +		xlog(D_GENERAL, "Unexpected return code from insert: %s",
> > +				sqlite3_errmsg(dbh));
> > +
> > +out_err:
> > +	xlog(D_GENERAL, "%s returning %d", __func__, ret);
> > +	sqlite3_finalize(stmt);
> > +	sqlite3_free(err);
> > +out_detach:
> > +	clstate_db_detach(dbname);
> > +	return ret;
> > +}
> > diff --git a/utils/clstated/sqlite.h b/utils/clstated/sqlite.h
> > new file mode 100644
> > index 0000000..4f81745
> > --- /dev/null
> > +++ b/utils/clstated/sqlite.h
> > @@ -0,0 +1,27 @@
> > +/*
> > + * Copyright (C) 2011  Red Hat, Jeff Layton <jlayton@xxxxxxxxxx>
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License
> > + * as published by the Free Software Foundation; either version 2
> > + * of the License, or (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write to the Free Software
> > + * Foundation, Inc., 51 Franklin Street, Fifth Floor,
> > + * Boston, MA 02110-1301, USA.
> > + */
> > +
> > +#ifndef _SQLITE_H_
> > +#define _SQLITE_H_
> > +
> > +void clstate_set_topdir(char *topdir);
> > +int clstate_maindb_init(void);
> > +int clstate_insert_client(const unsigned char *addr, const unsigned char *clname, const size_t namelen);
> > +
> > +#endif /* _SQLITE_H */
> > -- 
> > 1.7.1
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Jeff Layton <jlayton@xxxxxxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux