Re: how should we manage ganesha's export tables from ceph-mgr ?

Jeff Layton <jlayton@xxxxxxxxxx> · Thu, 10 Jan 2019 10:30:34 -0500

On Wed, 2019-01-09 at 23:37 +0000, Ricardo Dias wrote:
> 
> On 08/01/19 14:13, Jeff Layton wrote:
> > On Fri, 2019-01-04 at 12:35 +0000, Ricardo Dias wrote:
> > > On 04/01/19 12:05, Jeff Layton wrote:
> > > > On Fri, 2019-01-04 at 10:35 +0000, Ricardo Dias wrote:
> > > > > On 03/01/19 19:05, Jeff Layton wrote:
> > > > > > Ricardo,
> > > > > > 
> > > > > > We chatted earlier about a new ceph mgr module that would spin up EXPORT
> > > > > > blocks for ganesha and stuff them into a RADOS object. Here's basically
> > > > > > what we're aiming for. I think it's pretty similar to what SuSE's
> > > > > > solution is doing so I think it'd be good to collaborate here.
> > > > > 
> > > > > Just to make things more clear, We (SUSE) didn't implement a specific
> > > > > downstream implementation. The implementation we developed targets the
> > > > > upstream ceph-dashboard code.
> > > > > 
> > > > > The dashboard backend code to manage ganesha exports is almost done. We
> > > > > still haven't opened a PR because we are still finishing the frontend
> > > > > code, which might make the backend to change a bit.
> > > > > 
> > > > > The current code is located here:
> > > > > https://github.com/rjfd/ceph/tree/wip-dashboard-nfs
> > > > > 
> > > > > > Probably I should write this up in a doc somewhere, but here's what I'd
> > > > > > envision. First an overview:
> > > > > > 
> > > > > > The Rook.io ceph+ganesha CRD basically spins up nfs-ganesha pods under
> > > > > > k8s that don't export anything by default and have a fairly stock
> > > > > > config. Each ganesha daemon that is started has a boilerplate config
> > > > > > file that ends with a %url include like this:
> > > > > > 
> > > > > >     %url rados://<pool>/<namespace>/conf-<nodeid>
> > > > > > 
> > > > > > The nodeid in this case is the unique nodeid within a cluster of ganesha
> > > > > > servers using the rados_cluster recovery backend in ganesha. Rook
> > > > > > enumerates these starting with 'a' and going through 'z' (and then 'aa',
> > > > > > 'ab', etc.). So node 'a' would have a config object called "conf-a".
> > > > > > 
> > > > > 
> > > > > This was the same assumption we made, and the current implementation
> > > > > code can manage the exports of different servers (configuration objects).
> > > > > 
> > > > > > What we currently lack is the code to set up those conf-<nodeid>
> > > > > > objects. I know you have some code to do this sort of configuration via
> > > > > > the dashboard and a REST API. Would it make more sense to split this bit
> > > > > > out into a separate module, which would also allow it to be usable from
> > > > > > the command line?
> > > > > 
> > > > > Yes, and no :) I think the benefit of splitting the code into a separate
> > > > > module is on the possibility of other mgr modules to manage ganesha
> > > > > exports using the mgr "_remote" call infrastructure, or if someone wants
> > > > > to manage ganesha exports without enabling the dashboard module.
> > > > > 
> > > > > Regarding CLI commands, since the dashboard code exposes the export
> > > > > management through a REST API, we can always use curl to call it
> > > > > (although it will be a more verbose command).
> > > > > 
> > > > > In the dashboard source directory we have a small bash script to help
> > > > > calling the REST API from the CLI. Here's an example of an export
> > > > > creation using the current implementation:
> > > > > 
> > > > > $ ./run-backend-api-request.sh POST /api/nfs-ganesha/export \
> > > > >   '{
> > > > >      "hostname": "node1.domain",  \
> > > > >      "path": "/foo", \
> > > > >      "fsal": {"name": "CEPH", "user_id":"admin", "fs_name": "myfs"}, \
> > > > > 
> > > > >      "pseudo": "/foo", \
> > > > >      "tag": null, \
> > > > >      "access_type": "RW", \
> > > > >      "squash": "no_root_squash", \
> > > > >      "protocols":[4], \
> > > > >      "transports": ["TCP"], \
> > > > >      "clients": [{ \
> > > > >        "addresses":["10.0.0.0/8"], \
> > > > >        "access_type": "RO", \
> > > > >        "squash": "root" \
> > > > >      }]}'
> > > > > 
> > > > > The json fields and structure is similar to the ganesha export
> > > > > configuration structure.
> > > > > 
> > > > > We also have other commands:
> > > > > 
> > > > > # list all exports
> > > > > $ ./run-backend-api-request.sh GET /api/nfs-ganesha/export
> > > > > 
> > > > > # get an export
> > > > > $ ./run-backend-api-request.sh GET \
> > > > > 	/api/nfs-ganesha/export/<hostname>/<id>
> > > > > 
> > > > > # update an export
> > > > > $ ./run-backend-api-request.sh PUT \
> > > > > 	/api/nfs-ganesha/export/<hostname>/<id> <json string>
> > > > > 
> > > > > # delete an export
> > > > > $ ./run-backend-api-request.sh DELETE \
> > > > > 	/api/nfs-ganesha/export/<hostname>/<id>
> > > > > 
> > > > > 
> > > > > In the dashboard implementation, the server configuration is identified
> > > > > by the <hostname> field, which does not need to be a real hostname.
> > > > > The dashboard keeps a map between the hostname and the rados object URL
> > > > > that stores the configuration of the server.
> > > > > 
> > > > 
> > > > Ok, that all sounds fine, actually. I think we can probably live with
> > > > the REST API for this.
> > > > 
> > > > It might be good to rename the "hostname" field to something more
> > > > generic (maybe nodeid). The rados_cluster recovery backend for ganesha
> > > > requires a unique nodeid for each node. If it's not specified then it
> > > > will use the hostname.
> > > 
> > > Sounds good to me.
> > > 
> > > > > The bootstrap of this host/rados_url map can be done in two ways:
> > > > > a) automatically: when an orchestrator backend is avaliable, the
> > > > > dashboard asks the orchestrator for this information.
> > > > > b) manually: the dashboard provides some CLI commands to add this
> > > > > information. Example:
> > > > > $  ceph dashboard ganesha-host-add <hostname> <rados_url>
> > > > > 
> > > > 
> > > > I'll have to think about this bit.
> > > > 
> > > > The main use case we're interested in currently is Openstack Manila:
> > > > 
> > > >     https://wiki.openstack.org/wiki/Manila
> > > > 
> > > > It has its own REST API, and admins can request new servers to be
> > > > started or volumes to be created and exported.
> > > > 
> > > > What I had envisioned was that requests to manila would get translated
> > > > into requests to do things like:
> > > > 
> > > > create volumes and subvolumes
> > > > ask the orchestrator to spin up a new daemon
> > > > modify the conf-* objects and ask the orchestrator to send daemons a
> > > > SIGHUP
> > > > 
> > > > I think what you're proposing should be fine there. Probably I just need
> > > > to pull down your wip branch and play with it to better understand.
> > > 
> > > I think all should work in the above use case if the dashboard is using
> > > the orchestrator. After the orchestrator spins the new daemon, the
> > > dashboard will have access to the new daemon configuration without
> > > manual intervention.
> > > 
> > > > > > My thinking was that we'd probably want to create a new mgr module for
> > > > > > that, and could wire it up to the command line with something like:
> > > > > > 
> > > > > >     $ ceph nfsexport create --id=100			\
> > > > > > 			--pool=mypool			\
> > > > > > 			--namespace=mynamespace		\
> > > > > > 			--type=cephfs			\
> > > > > > 			--volume=myfs			\
> > > > > > 			--subvolume=/foo		\
> > > > > > 			--pseudo=/foo			\
> > > > > > 			--cephx_userid=admin		\
> > > > > > 			--cephx_key=<base64 key>	\
> > > > > > 			--client=10.0.0.0/8,ro,root	\
> > > > > > 			--client=admhost,rw,none
> > > > > > 
> > > > > > ...the "client" is a string that would be a tuple of client access
> > > > > > string, r/o or r/w, and the userid squashing mode, and could be
> > > > > > specified multiple times.
> > > > > 
> > > > > The above command is similar to what we provide in the REST API with the
> > > > > difference that the dashboard generates the export ID.
> > > > > 
> > > > > Do you think it is important for the user to explicitly specify the
> > > > > export ID?
> > > > > 
> > > > 
> > > > No, it'd be fine to autogenerate those in some fashion.
> > > > 
> > > > > > We'd also want to add a way to remove and enumerate exports. Maybe:
> > > > > > 
> > > > > >     $ ceph nfsexport ls
> > > > > >     $ ceph nfsexport rm --id=100
> > > > > > 
> > > > > > So the create command above would create an object called "export-100"
> > > > > > in the given rados_pool/rados_namespace. 
> > > > > > 
> > > > > > From there, we'd need to also be able to "link" and "unlink" these
> > > > > > export objects into the config files for each daemon. So if I have a
> > > > > > cluster of 2 servers with nodeids "a" and "b":
> > > > > > 
> > > > > >     $ ceph nfsexport link --pool=mypool			\
> > > > > > 			--namespace=mynamespace		\
> > > > > > 			--id=100 			\
> > > > > > 			--node=a			\
> > > > > > 			--node=b
> > > > > > 
> > > > > > ...with a corresponding "unlink" command. That would append objects
> > > > > > called "conf-a" and "conf-b" with this line:
> > > > > > 
> > > > > >     %url rados://mypool/mynamespace/export-100
> > > > > > 
> > > > > > ...and then call into the orchestrator to send a SIGHUP to the daemons
> > > > > > to make them pick up the new configs. We might also want to sanity check
> > > > > > whether any conf-* files are still linked to the export-* files before
> > > > > > removing those objects.
> > > > > > 
> > > > > > Thoughts?
> > > > > 
> > > > > I got a bit lost with this link/unlink part. In the current dashboard
> > > > > implementation, when we create an export the implementation will add the
> > > > > export configuration into the rados://<pool>/<namespace>/conf-<nodeid>
> > > > > object and call the orchestrator to update/restart the service.
> > > > > 
> > > > > It looks to me that you are separating the export creation from the
> > > > > export deployment. First you create the export, and then you add it to
> > > > > the service configuration.
> > > > > 
> > > > > We can also implement this two-step behavior in the dashboard
> > > > > implementation and in the dashboard Web UI we can have a checkbox where
> > > > > the user can specify if it wants to apply the new export right away or not.
> > > > > 
> > > > > In the dashboard, we will also implement a "copy" command to copy an
> > > > > export configuration to another ganesha server. That will help with
> > > > > creating similar exports in different servers.
> > > > > Another option would be to instead of having a single "<hostname>" field
> > > > > in the create export function, to have a list of <hostname>s.
> > > > > 
> > > > 
> > > > The two-step process was not so much for immediacy as to eliminate the
> > > > need to replicate all of the EXPORT blocks across a potentially large
> > > > series of objects. If we (e.g.) needed to modify a CLIENT block to allow
> > > > a new subnet to have access, we'd only need to change the one object and
> > > > then SIGHUP all of the daemons.
> > > > 
> > > > That said, if replicating those blocks across multiple objects is
> > > > simpler then we'll adapt.
> > > 
> > > After thinking more about this, I think the approach you suggest about
> > > using an object for each export and then link it to the servers
> > > configuration makes more sense, and avoids the need of a "Copy Export"
> > > operation.
> > > 
> > > Since you will also consume the dashboard REST API besides the dashboard
> > > frontend, I'll open a PR with just the backend implementation, so that
> > > it can be merged quickly without waiting for the frontend to be ready.
> > > 
> > > 
> > 
> > Thanks! I spent some time playing around with your branch and I think it
> > looks like it'll work just fine for us.
> > 
> > Just a few notes for anyone else that wants to do this. As Ricardo
> > mentioned separately, RADOS namespace support is not yet plumbed in, so
> > I'm using a separate "nfs-ganesha" pool here to house the RADOS objects
> > needed for ganesha's configs and recovery backend:
> > 
> > $ MON=3 OSD=1 MGR=1 MDS=1 ../src/vstart.sh -n
> > 
> > $ ganesha-rados-grace -p nfs-ganesha `hostname`
> > 
> > $ rados create -p nfs-ganesha conf-`hostname`
> > 
> > $ ceph dashboard ganesha-host-add `hostname` rados://nfs-ganesha/conf-`hostname`
> > 
> > $  ./run-backend-api-request.sh POST /api/nfs-ganesha/export "`cat ~/export.json`"
> > 
> > ...where export.json is something like:
> > 
> > -----------------[snip]------------------
> > {
> >      "hostname": "server_hostname",
> >      "path": "/foo",
> >      "fsal": {"name": "CEPH", "user_id":"admin", "fs_name": "myfs"},
> >      "pseudo": "/foo",
> >      "tag": null,
> >      "access_type": "RW",
> >      "squash": "no_root_squash",
> >      "protocols": [4],
> >      "transports": ["TCP"],
> >      "clients": [{
> >        "addresses":["10.0.0.0/8"],
> >        "access_type": "RO",
> >        "squash": "root"
> >      }]
> > }
> > -----------------[snip]-------------------
> > 
> > This creates an object with a ganesha config EXPORT block which looks
> > valid. We may need to tweak it a bit, but I think this should work just
> > fine.
> 
> Thanks for posting the above steps!
> 
> > I know the web UI is still pretty raw, but here are some comments
> > anyway:
> > 
> > For safety reasons, the default Access Type should probably be "RO", and
> > the default Squash mode should be "Root" or maybe even "All". You may
> > also want to somehow ensure that the admin consciously decides to export
> > to the world instead of making that the default when no client is
> > specified.
> 
> This is very valuable information. I never administrated an NFS ganesha
> server and therefore don't have experience on what should be the
> defaults. Thanks for the suggestions.
> 

No problem. We definitely want this to be a "safe by default" design, as
much as possible. Getting exports wrong is a great way to compromise
security in some environments.

> > It'd be nice to be able to granularly select the NFSv4 minorversions. If
> > you exclusively have NFSv4.1+ clients, then the grace period can be
> > lifted early after a restart. That's a big deal for continuous
> > operation. In our clustered configurations, we plan to not support
> > anything before v4.1 by default.
> 
> I didn't know about the existence of minorversions. Where can get the
> list of all possible values for the protocol version?
> 

Those are all governed by the IETF RFCs. Basically we have v4.0, v4.1
and v4.2 so far, and I wouldn't worry about anything beyond that at this
point.

We may eventually end up with a v4.3, but we're sort of moving to a
model that is based on feature flags so that may not ever materialize.

> > We'll probably need some way to specify the fsal.user_id field in the UI
> > too. Maybe a dropdown box that enumerates the available principals?
> 
> Yes, and that's already been done by Tiago Melo (tmelo on IRC) in its
> development branch. I believe he has added a dropdown with the list of
> cephx users.
> 

Nice.

> > That's all for now. I think what I'll probably do is close out my PR to
> > add NFS support to the orchestrator and concentrate on wiring the rook
> > orchestrator into what you have, since it's more complete.
> > 

FWIW...After I took a closer look, I think the PR I had to add NFS
support to the orchestrator is most orthogonal to your changes, so I
think we'll probably want to merge the latest version of it after all.

-- 
Jeff Layton <jlayton@xxxxxxxxxx>