Re: how should we manage ganesha's export tables from ceph-mgr ?

Ricardo Dias <rdias@xxxxxxxx> · Thu, 10 Jan 2019 15:54:52 +0000

On 10/01/19 15:30, Jeff Layton wrote:
> On Wed, 2019-01-09 at 23:37 +0000, Ricardo Dias wrote:
>>
>> On 08/01/19 14:13, Jeff Layton wrote:
>>> On Fri, 2019-01-04 at 12:35 +0000, Ricardo Dias wrote:
>>>> On 04/01/19 12:05, Jeff Layton wrote:
>>>>> On Fri, 2019-01-04 at 10:35 +0000, Ricardo Dias wrote:
>>>>>> On 03/01/19 19:05, Jeff Layton wrote:
>>>>>>> Ricardo,
>>>>>>>
>>>>>>> We chatted earlier about a new ceph mgr module that would spin up EXPORT
>>>>>>> blocks for ganesha and stuff them into a RADOS object. Here's basically
>>>>>>> what we're aiming for. I think it's pretty similar to what SuSE's
>>>>>>> solution is doing so I think it'd be good to collaborate here.
>>>>>>
>>>>>> Just to make things more clear, We (SUSE) didn't implement a specific
>>>>>> downstream implementation. The implementation we developed targets the
>>>>>> upstream ceph-dashboard code.
>>>>>>
>>>>>> The dashboard backend code to manage ganesha exports is almost done. We
>>>>>> still haven't opened a PR because we are still finishing the frontend
>>>>>> code, which might make the backend to change a bit.
>>>>>>
>>>>>> The current code is located here:
>>>>>> https://github.com/rjfd/ceph/tree/wip-dashboard-nfs
>>>>>>
>>>>>>> Probably I should write this up in a doc somewhere, but here's what I'd
>>>>>>> envision. First an overview:
>>>>>>>
>>>>>>> The Rook.io ceph+ganesha CRD basically spins up nfs-ganesha pods under
>>>>>>> k8s that don't export anything by default and have a fairly stock
>>>>>>> config. Each ganesha daemon that is started has a boilerplate config
>>>>>>> file that ends with a %url include like this:
>>>>>>>
>>>>>>>     %url rados://<pool>/<namespace>/conf-<nodeid>
>>>>>>>
>>>>>>> The nodeid in this case is the unique nodeid within a cluster of ganesha
>>>>>>> servers using the rados_cluster recovery backend in ganesha. Rook
>>>>>>> enumerates these starting with 'a' and going through 'z' (and then 'aa',
>>>>>>> 'ab', etc.). So node 'a' would have a config object called "conf-a".
>>>>>>>
>>>>>>
>>>>>> This was the same assumption we made, and the current implementation
>>>>>> code can manage the exports of different servers (configuration objects).
>>>>>>
>>>>>>> What we currently lack is the code to set up those conf-<nodeid>
>>>>>>> objects. I know you have some code to do this sort of configuration via
>>>>>>> the dashboard and a REST API. Would it make more sense to split this bit
>>>>>>> out into a separate module, which would also allow it to be usable from
>>>>>>> the command line?
>>>>>>
>>>>>> Yes, and no :) I think the benefit of splitting the code into a separate
>>>>>> module is on the possibility of other mgr modules to manage ganesha
>>>>>> exports using the mgr "_remote" call infrastructure, or if someone wants
>>>>>> to manage ganesha exports without enabling the dashboard module.
>>>>>>
>>>>>> Regarding CLI commands, since the dashboard code exposes the export
>>>>>> management through a REST API, we can always use curl to call it
>>>>>> (although it will be a more verbose command).
>>>>>>
>>>>>> In the dashboard source directory we have a small bash script to help
>>>>>> calling the REST API from the CLI. Here's an example of an export
>>>>>> creation using the current implementation:
>>>>>>
>>>>>> $ ./run-backend-api-request.sh POST /api/nfs-ganesha/export \
>>>>>>   '{
>>>>>>      "hostname": "node1.domain",  \
>>>>>>      "path": "/foo", \
>>>>>>      "fsal": {"name": "CEPH", "user_id":"admin", "fs_name": "myfs"}, \
>>>>>>
>>>>>>      "pseudo": "/foo", \
>>>>>>      "tag": null, \
>>>>>>      "access_type": "RW", \
>>>>>>      "squash": "no_root_squash", \
>>>>>>      "protocols":[4], \
>>>>>>      "transports": ["TCP"], \
>>>>>>      "clients": [{ \
>>>>>>        "addresses":["10.0.0.0/8"], \
>>>>>>        "access_type": "RO", \
>>>>>>        "squash": "root" \
>>>>>>      }]}'
>>>>>>
>>>>>> The json fields and structure is similar to the ganesha export
>>>>>> configuration structure.
>>>>>>
>>>>>> We also have other commands:
>>>>>>
>>>>>> # list all exports
>>>>>> $ ./run-backend-api-request.sh GET /api/nfs-ganesha/export
>>>>>>
>>>>>> # get an export
>>>>>> $ ./run-backend-api-request.sh GET \
>>>>>> 	/api/nfs-ganesha/export/<hostname>/<id>
>>>>>>
>>>>>> # update an export
>>>>>> $ ./run-backend-api-request.sh PUT \
>>>>>> 	/api/nfs-ganesha/export/<hostname>/<id> <json string>
>>>>>>
>>>>>> # delete an export
>>>>>> $ ./run-backend-api-request.sh DELETE \
>>>>>> 	/api/nfs-ganesha/export/<hostname>/<id>
>>>>>>
>>>>>>
>>>>>> In the dashboard implementation, the server configuration is identified
>>>>>> by the <hostname> field, which does not need to be a real hostname.
>>>>>> The dashboard keeps a map between the hostname and the rados object URL
>>>>>> that stores the configuration of the server.
>>>>>>
>>>>>
>>>>> Ok, that all sounds fine, actually. I think we can probably live with
>>>>> the REST API for this.
>>>>>
>>>>> It might be good to rename the "hostname" field to something more
>>>>> generic (maybe nodeid). The rados_cluster recovery backend for ganesha
>>>>> requires a unique nodeid for each node. If it's not specified then it
>>>>> will use the hostname.
>>>>
>>>> Sounds good to me.
>>>>
>>>>>> The bootstrap of this host/rados_url map can be done in two ways:
>>>>>> a) automatically: when an orchestrator backend is avaliable, the
>>>>>> dashboard asks the orchestrator for this information.
>>>>>> b) manually: the dashboard provides some CLI commands to add this
>>>>>> information. Example:
>>>>>> $  ceph dashboard ganesha-host-add <hostname> <rados_url>
>>>>>>
>>>>>
>>>>> I'll have to think about this bit.
>>>>>
>>>>> The main use case we're interested in currently is Openstack Manila:
>>>>>
>>>>>     https://wiki.openstack.org/wiki/Manila
>>>>>
>>>>> It has its own REST API, and admins can request new servers to be
>>>>> started or volumes to be created and exported.
>>>>>
>>>>> What I had envisioned was that requests to manila would get translated
>>>>> into requests to do things like:
>>>>>
>>>>> create volumes and subvolumes
>>>>> ask the orchestrator to spin up a new daemon
>>>>> modify the conf-* objects and ask the orchestrator to send daemons a
>>>>> SIGHUP
>>>>>
>>>>> I think what you're proposing should be fine there. Probably I just need
>>>>> to pull down your wip branch and play with it to better understand.
>>>>
>>>> I think all should work in the above use case if the dashboard is using
>>>> the orchestrator. After the orchestrator spins the new daemon, the
>>>> dashboard will have access to the new daemon configuration without
>>>> manual intervention.
>>>>
>>>>>>> My thinking was that we'd probably want to create a new mgr module for
>>>>>>> that, and could wire it up to the command line with something like:
>>>>>>>
>>>>>>>     $ ceph nfsexport create --id=100			\
>>>>>>> 			--pool=mypool			\
>>>>>>> 			--namespace=mynamespace		\
>>>>>>> 			--type=cephfs			\
>>>>>>> 			--volume=myfs			\
>>>>>>> 			--subvolume=/foo		\
>>>>>>> 			--pseudo=/foo			\
>>>>>>> 			--cephx_userid=admin		\
>>>>>>> 			--cephx_key=<base64 key>	\
>>>>>>> 			--client=10.0.0.0/8,ro,root	\
>>>>>>> 			--client=admhost,rw,none
>>>>>>>
>>>>>>> ...the "client" is a string that would be a tuple of client access
>>>>>>> string, r/o or r/w, and the userid squashing mode, and could be
>>>>>>> specified multiple times.
>>>>>>
>>>>>> The above command is similar to what we provide in the REST API with the
>>>>>> difference that the dashboard generates the export ID.
>>>>>>
>>>>>> Do you think it is important for the user to explicitly specify the
>>>>>> export ID?
>>>>>>
>>>>>
>>>>> No, it'd be fine to autogenerate those in some fashion.
>>>>>
>>>>>>> We'd also want to add a way to remove and enumerate exports. Maybe:
>>>>>>>
>>>>>>>     $ ceph nfsexport ls
>>>>>>>     $ ceph nfsexport rm --id=100
>>>>>>>
>>>>>>> So the create command above would create an object called "export-100"
>>>>>>> in the given rados_pool/rados_namespace. 
>>>>>>>
>>>>>>> From there, we'd need to also be able to "link" and "unlink" these
>>>>>>> export objects into the config files for each daemon. So if I have a
>>>>>>> cluster of 2 servers with nodeids "a" and "b":
>>>>>>>
>>>>>>>     $ ceph nfsexport link --pool=mypool			\
>>>>>>> 			--namespace=mynamespace		\
>>>>>>> 			--id=100 			\
>>>>>>> 			--node=a			\
>>>>>>> 			--node=b
>>>>>>>
>>>>>>> ...with a corresponding "unlink" command. That would append objects
>>>>>>> called "conf-a" and "conf-b" with this line:
>>>>>>>
>>>>>>>     %url rados://mypool/mynamespace/export-100
>>>>>>>
>>>>>>> ...and then call into the orchestrator to send a SIGHUP to the daemons
>>>>>>> to make them pick up the new configs. We might also want to sanity check
>>>>>>> whether any conf-* files are still linked to the export-* files before
>>>>>>> removing those objects.
>>>>>>>
>>>>>>> Thoughts?
>>>>>>
>>>>>> I got a bit lost with this link/unlink part. In the current dashboard
>>>>>> implementation, when we create an export the implementation will add the
>>>>>> export configuration into the rados://<pool>/<namespace>/conf-<nodeid>
>>>>>> object and call the orchestrator to update/restart the service.
>>>>>>
>>>>>> It looks to me that you are separating the export creation from the
>>>>>> export deployment. First you create the export, and then you add it to
>>>>>> the service configuration.
>>>>>>
>>>>>> We can also implement this two-step behavior in the dashboard
>>>>>> implementation and in the dashboard Web UI we can have a checkbox where
>>>>>> the user can specify if it wants to apply the new export right away or not.
>>>>>>
>>>>>> In the dashboard, we will also implement a "copy" command to copy an
>>>>>> export configuration to another ganesha server. That will help with
>>>>>> creating similar exports in different servers.
>>>>>> Another option would be to instead of having a single "<hostname>" field
>>>>>> in the create export function, to have a list of <hostname>s.
>>>>>>
>>>>>
>>>>> The two-step process was not so much for immediacy as to eliminate the
>>>>> need to replicate all of the EXPORT blocks across a potentially large
>>>>> series of objects. If we (e.g.) needed to modify a CLIENT block to allow
>>>>> a new subnet to have access, we'd only need to change the one object and
>>>>> then SIGHUP all of the daemons.
>>>>>
>>>>> That said, if replicating those blocks across multiple objects is
>>>>> simpler then we'll adapt.
>>>>
>>>> After thinking more about this, I think the approach you suggest about
>>>> using an object for each export and then link it to the servers
>>>> configuration makes more sense, and avoids the need of a "Copy Export"
>>>> operation.
>>>>
>>>> Since you will also consume the dashboard REST API besides the dashboard
>>>> frontend, I'll open a PR with just the backend implementation, so that
>>>> it can be merged quickly without waiting for the frontend to be ready.
>>>>
>>>>
>>>
>>> Thanks! I spent some time playing around with your branch and I think it
>>> looks like it'll work just fine for us.
>>>
>>> Just a few notes for anyone else that wants to do this. As Ricardo
>>> mentioned separately, RADOS namespace support is not yet plumbed in, so
>>> I'm using a separate "nfs-ganesha" pool here to house the RADOS objects
>>> needed for ganesha's configs and recovery backend:
>>>
>>> $ MON=3 OSD=1 MGR=1 MDS=1 ../src/vstart.sh -n
>>>
>>> $ ganesha-rados-grace -p nfs-ganesha `hostname`
>>>
>>> $ rados create -p nfs-ganesha conf-`hostname`
>>>
>>> $ ceph dashboard ganesha-host-add `hostname` rados://nfs-ganesha/conf-`hostname`
>>>
>>> $  ./run-backend-api-request.sh POST /api/nfs-ganesha/export "`cat ~/export.json`"
>>>
>>> ...where export.json is something like:
>>>
>>> -----------------[snip]------------------
>>> {
>>>      "hostname": "server_hostname",
>>>      "path": "/foo",
>>>      "fsal": {"name": "CEPH", "user_id":"admin", "fs_name": "myfs"},
>>>      "pseudo": "/foo",
>>>      "tag": null,
>>>      "access_type": "RW",
>>>      "squash": "no_root_squash",
>>>      "protocols": [4],
>>>      "transports": ["TCP"],
>>>      "clients": [{
>>>        "addresses":["10.0.0.0/8"],
>>>        "access_type": "RO",
>>>        "squash": "root"
>>>      }]
>>> }
>>> -----------------[snip]-------------------
>>>
>>> This creates an object with a ganesha config EXPORT block which looks
>>> valid. We may need to tweak it a bit, but I think this should work just
>>> fine.
>>
>> Thanks for posting the above steps!
>>
>>> I know the web UI is still pretty raw, but here are some comments
>>> anyway:
>>>
>>> For safety reasons, the default Access Type should probably be "RO", and
>>> the default Squash mode should be "Root" or maybe even "All". You may
>>> also want to somehow ensure that the admin consciously decides to export
>>> to the world instead of making that the default when no client is
>>> specified.
>>
>> This is very valuable information. I never administrated an NFS ganesha
>> server and therefore don't have experience on what should be the
>> defaults. Thanks for the suggestions.
>>
> 
> No problem. We definitely want this to be a "safe by default" design, as
> much as possible. Getting exports wrong is a great way to compromise
> security in some environments.
> 
>>> It'd be nice to be able to granularly select the NFSv4 minorversions. If
>>> you exclusively have NFSv4.1+ clients, then the grace period can be
>>> lifted early after a restart. That's a big deal for continuous
>>> operation. In our clustered configurations, we plan to not support
>>> anything before v4.1 by default.
>>
>> I didn't know about the existence of minorversions. Where can get the
>> list of all possible values for the protocol version?
>>
> 
> Those are all governed by the IETF RFCs. Basically we have v4.0, v4.1
> and v4.2 so far, and I wouldn't worry about anything beyond that at this
> point.
> 
> We may eventually end up with a v4.3, but we're sort of moving to a
> model that is based on feature flags so that may not ever materialize.
> 
>>> We'll probably need some way to specify the fsal.user_id field in the UI
>>> too. Maybe a dropdown box that enumerates the available principals?
>>
>> Yes, and that's already been done by Tiago Melo (tmelo on IRC) in its
>> development branch. I believe he has added a dropdown with the list of
>> cephx users.
>>
> 
> Nice.
> 
>>> That's all for now. I think what I'll probably do is close out my PR to
>>> add NFS support to the orchestrator and concentrate on wiring the rook
>>> orchestrator into what you have, since it's more complete.
>>>
> 
> FWIW...After I took a closer look, I think the PR I had to add NFS
> support to the orchestrator is most orthogonal to your changes, so I
> think we'll probably want to merge the latest version of it after all.

Yes, I agree.

-- 
Ricardo Dias
Senior Software Engineer - Storage Team
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284
(AG Nürnberg)

Attachment:
signature.asc

Description: OpenPGP digital signature