Re: [Nfs-ganesha-devel] [nfs-ganesha RFC PATCH v2 10/13] support: add a rados_grace support library

Jeff Layton <jlayton@xxxxxxxxxx> · Thu, 17 May 2018 08:38:37 -0400

On Wed, 2018-05-16 at 17:33 -0400, J. Bruce Fields wrote:
> I can't realistically review most of this code, so I went looking for
> some documentation and found this.  Maybe it's not the best starting
> point.  Forgive me if I seem dense, I'd just really like to see
> everything spelled out very precisely, and neither this nor your
> original presentation quite does that for me yet:
> 

Thanks for looking!

Yes, the comments here are a mess. I'll clean them up before the next
posting. Maybe I'll just transfer this to a RST doc and refer to it in
the comments.

> On Thu, May 03, 2018 at 02:58:00PM -0400, Jeff Layton wrote:
> > + * The rados_grace database is a rados object with a well-known name that
> > + * with which all cluster nodes can interact to coordinate grace-period
> > + * enforcement.
> > + *
> > + * It consists of two parts:
> > + *
> > + * 1) 2 uint64_t epoch values (stored LE) that indicate the serial number of
> > + * the current grace period (C) and the serial number of the grace period that
> 
> Delete "that".
> 
> > + * from which recovery is currently allowed (R). These are stored as object
> > + * data.
> > + *
> > + * 2) An omap containing a key value pair for each cluster node. The key is
> > + * the hostname of the node running ganesha, and the value is a byte with a
> > + * set of flags.
> > + *
> > + * Consider a single server epoch (E) of an individual NFS server to be the
> > + * period between reboots. That consists of an initial grace period and
> > + * a regular operation period. An epoch value of 0 is never valid.
> 
> Does "epoch value" mean the same thing as "serial number" above?  I
> assume it's something that uniquely identifies an "epoch".
> 

Yes.

> Also you've defined an "epoch" for a single server, it needs definition
> for a cluster too, right?

I'll do that. Basically the epoch is a cluster-wide property. It's just
that with a single server, you have a trivial cluster of one host.

> > + *
> > + * The first value (C) indicates the current server epoch. The client recovery
> > + * db should be tagged with this value on creation, or when updating the db
> > + * after the grace period has been fully lifted.
> 
> What's the "client recovery db"?  I guess it's the per-node database of
> long-form client identifiers identifying clients that are allowed to
> reclaim state?
> 

Yes, exactly.

> > + *
> > + * The second uint64_t value
> 
> (R)
> 
> > in the data tells the NFS server from what
> > + * recovery db it is allowed to reclaim. A value of 0 in this field means that
> > + * we are out of the cluster-wide grace period and that no recovery is allowed.
> > + *
> > + * The omap contains a key for each host in the cluster. Typically, nodes join
> > + * the cluster by setting their omap key. The value of the omap is a single
> > + * byte that contains a set of flags that indicates their current need for a
> > + * grace period and whether they are locally enforcing one.
> 
> Is it really just those two flags?  A list of flags here would be
> helpful.
> 

Yes, just those two flags at this time. I'll plan to list them in a more
tabular way.

> > + *
> > + * The grace period handling engine will update and store the flags, and it
> > + * can be queried to determine whether other nodes may need a grace period or
> > + * are enforcing.
> > + */

Thanks for the review!
-- 
Jeff Layton <jlayton@xxxxxxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html