On Dec 14, 2011, at 11:15 AM, Jeff Layton wrote: > On Wed, 14 Dec 2011 10:47:56 -0500 > Chuck Lever <chuck.lever@xxxxxxxxxx> wrote: > >> >> On Dec 14, 2011, at 10:14 AM, Jeff Layton wrote: >> >>> On Wed, 14 Dec 2011 09:56:32 -0500 >>> Chuck Lever <chuck.lever@xxxxxxxxxx> wrote: >>> >>>> >>>> On Dec 14, 2011, at 8:57 AM, Jeff Layton wrote: >>>> >>>>> Rather than roll our own "storage engine", use sqlite instead. It fits >>>>> the bill nicely as it does: >>>>> >>>>> - durable on-disk storage >>>>> - the ability to constrain record uniqueness >>>>> - a facility for collating and searching the host records >>>> >>>> It should also allow multiple processes (possibly on multiple cluster nodes) to access the on-disk data safely and atomically. No need to invent a file locking scheme from scratch. >>>> >>> >>> Indeed. That's one of the main reasons I chose sqlite here. >>> >>>>> ...it does add a build dependency to nfs-utils, but almost all modern >>>>> distros provide those packages. >>>>> >>>>> The current incarnation of this code dynamically links against a >>>>> provided sqlite library, but we could also consider including their >>>>> single-file "amalgamation" to reduce dependencies (though with all >>>>> the caveats that that entails). >>>>> >>>>> Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx> >>>>> --- >>>>> utils/clstated/Makefile.am | 4 +- >>>>> utils/clstated/clstated.c | 16 ++- >>>>> utils/clstated/sqlite.c | 351 ++++++++++++++++++++++++++++++++++++++++++++ >>>>> utils/clstated/sqlite.h | 27 ++++ >>>>> 4 files changed, 393 insertions(+), 5 deletions(-) >>>>> create mode 100644 utils/clstated/sqlite.c >>>>> create mode 100644 utils/clstated/sqlite.h >>>>> >>>>> diff --git a/utils/clstated/Makefile.am b/utils/clstated/Makefile.am >>>>> index 9937353..d1cef3c 100644 >>>>> --- a/utils/clstated/Makefile.am >>>>> +++ b/utils/clstated/Makefile.am >>>>> @@ -6,9 +6,9 @@ >>>>> AM_CFLAGS += -D_LARGEFILE64_SOURCE >>>>> sbin_PROGRAMS = clstated >>>>> >>>>> -clstated_SOURCES = clstated.c >>>>> +clstated_SOURCES = clstated.c sqlite.c >>>>> >>>>> -clstated_LDADD = ../../support/nfs/libnfs.a $(LIBEVENT) >>>>> +clstated_LDADD = ../../support/nfs/libnfs.a $(LIBEVENT) $(LIBSQLITE) >>>>> >>>>> MAINTAINERCLEANFILES = Makefile.in >>>>> >>>>> diff --git a/utils/clstated/clstated.c b/utils/clstated/clstated.c >>>>> index aab09a7..95ac696 100644 >>>>> --- a/utils/clstated/clstated.c >>>>> +++ b/utils/clstated/clstated.c >>>>> @@ -36,6 +36,7 @@ >>>>> >>>>> #include "xlog.h" >>>>> #include "nfslib.h" >>>>> +#include "sqlite.h" >>>>> >>>>> #ifndef PIPEFS_DIR >>>>> #define PIPEFS_DIR NFS_STATEDIR "/rpc_pipefs" >>>>> @@ -118,21 +119,25 @@ clstate_pipe_reopen(struct clstate_client *clnt) >>>>> static void >>>>> clstate_create(struct clstate_client *clnt) >>>>> { >>>>> + int ret; >>>>> ssize_t bsize, wsize; >>>>> struct clstate_msg *cmsg = &clnt->cl_msg; >>>>> >>>>> - xlog(D_GENERAL, "%s: create client record", __func__); >>>>> + xlog(D_GENERAL, "%s: create client record. cm_addr=%s", __func__, >>>>> + cmsg->cm_addr); >>>>> >>>>> - /* FIXME: create client record on storage here */ >>>>> + ret = clstate_insert_client(cmsg->cm_addr, cmsg->cm_u.cm_id, >>>>> + cmsg->cm_len); >>>>> >>>>> /* set up reply */ >>>>> - cmsg->cm_status = 0; >>>>> + cmsg->cm_status = ret ? -EREMOTEIO : ret; >>>>> cmsg->cm_len = 0; >>>>> memset(cmsg->cm_addr, 0, sizeof(cmsg->cm_addr) + >>>>> sizeof(cmsg->cm_u.cm_id)); >>>>> >>>>> bsize = sizeof(*cmsg); >>>>> >>>>> + xlog(D_GENERAL, "Doing downcall with status %d", cmsg->cm_status); >>>>> wsize = atomicio((void *)write, clnt->cl_fd, cmsg, bsize); >>>>> if (wsize != bsize) { >>>>> xlog(L_ERROR, "%s: problem writing to clstate pipe (%ld): %m", >>>>> @@ -231,6 +236,11 @@ main(int argc, char **argv) >>>>> } >>>>> >>>>> /* set up storage db */ >>>>> + rc = clstate_maindb_init(); >>>>> + if (rc) { >>>>> + xlog(L_ERROR, "Failed to open main database: %d", rc); >>>>> + goto out; >>>>> + } >>>>> >>>>> /* set up event handler */ >>>>> fd = clstate_pipe_open(&clnt); >>>>> diff --git a/utils/clstated/sqlite.c b/utils/clstated/sqlite.c >>>>> new file mode 100644 >>>>> index 0000000..ae83634 >>>>> --- /dev/null >>>>> +++ b/utils/clstated/sqlite.c >>>>> @@ -0,0 +1,351 @@ >>>>> +/* >>>>> + * Copyright (C) 2011 Red Hat, Jeff Layton <jlayton@xxxxxxxxxx> >>>>> + * >>>>> + * This program is free software; you can redistribute it and/or >>>>> + * modify it under the terms of the GNU General Public License >>>>> + * as published by the Free Software Foundation; either version 2 >>>>> + * of the License, or (at your option) any later version. >>>>> + * >>>>> + * This program is distributed in the hope that it will be useful, >>>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of >>>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the >>>>> + * GNU General Public License for more details. >>>>> + * >>>>> + * You should have received a copy of the GNU General Public License >>>>> + * along with this program; if not, write to the Free Software >>>>> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, >>>>> + * Boston, MA 02110-1301, USA. >>>>> + */ >>>>> + >>>>> +/* >>>>> + * Explanation: >>>>> + * >>>>> + * This file contains the code to manage the sqlite backend database for the >>>>> + * clstated upcall daemon. >>>>> + * >>>>> + * The main database is called main.sqlite and contains the following tables: >>>>> + * >>>>> + * parameters: simple key/value pairs for storing database info >>>>> + * >>>>> + * addresses: list of server-side addresses recorded in the db, along with >>>>> + * the filenames of their respective db files. >>>>> + * >>>>> + * The daemon attaches to each server-address database as needed. Each >>>>> + * server-address database has the following tables: >>>> >>>> Why do you keep separate database files? There are ways to store all of this data in a single database using multiple tables. I'm happy to help you with table design if you like. >>>> >>> >>> As you pointed out above, sqlite allows us to share the db directory >>> across a cluster of nodes. Using different files here will allow us to >>> reduce lock contention in the cluster. If we put each server address DB >>> in a separate file, then only the node that currently owns that address >>> should need to touch it under normal circumstances. >> >> This seems like premature optimization. A single file is much simpler and I'm guessing would be a more standard arrangement of tables. If cluster contention avoidance is a key design point, it's worth calling out in the documenting comment, because this multi-file arrangement is not a typical use of sqlite3. >> >> Why do you predict that the database will be under contention? Isn't it the case that sqlite3 can allow multiple accessors to a single database file? >> > > Since the grace periods will need to be coordinated, then it's likely > on a server reboot that clients will be reestablishing state with > multiple nodes in the cluster at a time. That means at least some > contention for the DB. Whether it's problematic or not is an open > question at this point, but why not avoid that if we can? Yes, there may be contention during reclaim. However, I submit that we don't understand the details of that well enough to solidify a design for handling that workload scalably at this point. I recommend some simple prototyping and careful modeling to understand exactly what's going to happen in such a scenario. Then we can approach a client ID database design with good information about what needs to scale and what can be designed for convenience. :-) Just a thought. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html