Re: how should we manage ganesha's export tables from ceph-mgr ?

Ricardo Dias <rdias@xxxxxxxx> · Wed, 9 Jan 2019 23:37:35 +0000

On 08/01/19 14:13, Jeff Layton wrote:
> On Fri, 2019-01-04 at 12:35 +0000, Ricardo Dias wrote:
>>
>> On 04/01/19 12:05, Jeff Layton wrote:
>>> On Fri, 2019-01-04 at 10:35 +0000, Ricardo Dias wrote:
>>>> On 03/01/19 19:05, Jeff Layton wrote:
>>>>> Ricardo,
>>>>>
>>>>> We chatted earlier about a new ceph mgr module that would spin up EXPORT
>>>>> blocks for ganesha and stuff them into a RADOS object. Here's basically
>>>>> what we're aiming for. I think it's pretty similar to what SuSE's
>>>>> solution is doing so I think it'd be good to collaborate here.
>>>>
>>>> Just to make things more clear, We (SUSE) didn't implement a specific
>>>> downstream implementation. The implementation we developed targets the
>>>> upstream ceph-dashboard code.
>>>>
>>>> The dashboard backend code to manage ganesha exports is almost done. We
>>>> still haven't opened a PR because we are still finishing the frontend
>>>> code, which might make the backend to change a bit.
>>>>
>>>> The current code is located here:
>>>> https://github.com/rjfd/ceph/tree/wip-dashboard-nfs
>>>>
>>>>> Probably I should write this up in a doc somewhere, but here's what I'd
>>>>> envision. First an overview:
>>>>>
>>>>> The Rook.io ceph+ganesha CRD basically spins up nfs-ganesha pods under
>>>>> k8s that don't export anything by default and have a fairly stock
>>>>> config. Each ganesha daemon that is started has a boilerplate config
>>>>> file that ends with a %url include like this:
>>>>>
>>>>>     %url rados://<pool>/<namespace>/conf-<nodeid>
>>>>>
>>>>> The nodeid in this case is the unique nodeid within a cluster of ganesha
>>>>> servers using the rados_cluster recovery backend in ganesha. Rook
>>>>> enumerates these starting with 'a' and going through 'z' (and then 'aa',
>>>>> 'ab', etc.). So node 'a' would have a config object called "conf-a".
>>>>>
>>>>
>>>> This was the same assumption we made, and the current implementation
>>>> code can manage the exports of different servers (configuration objects).
>>>>
>>>>> What we currently lack is the code to set up those conf-<nodeid>
>>>>> objects. I know you have some code to do this sort of configuration via
>>>>> the dashboard and a REST API. Would it make more sense to split this bit
>>>>> out into a separate module, which would also allow it to be usable from
>>>>> the command line?
>>>>
>>>> Yes, and no :) I think the benefit of splitting the code into a separate
>>>> module is on the possibility of other mgr modules to manage ganesha
>>>> exports using the mgr "_remote" call infrastructure, or if someone wants
>>>> to manage ganesha exports without enabling the dashboard module.
>>>>
>>>> Regarding CLI commands, since the dashboard code exposes the export
>>>> management through a REST API, we can always use curl to call it
>>>> (although it will be a more verbose command).
>>>>
>>>> In the dashboard source directory we have a small bash script to help
>>>> calling the REST API from the CLI. Here's an example of an export
>>>> creation using the current implementation:
>>>>
>>>> $ ./run-backend-api-request.sh POST /api/nfs-ganesha/export \
>>>>   '{
>>>>      "hostname": "node1.domain",  \
>>>>      "path": "/foo", \
>>>>      "fsal": {"name": "CEPH", "user_id":"admin", "fs_name": "myfs"}, \
>>>>
>>>>      "pseudo": "/foo", \
>>>>      "tag": null, \
>>>>      "access_type": "RW", \
>>>>      "squash": "no_root_squash", \
>>>>      "protocols":[4], \
>>>>      "transports": ["TCP"], \
>>>>      "clients": [{ \
>>>>        "addresses":["10.0.0.0/8"], \
>>>>        "access_type": "RO", \
>>>>        "squash": "root" \
>>>>      }]}'
>>>>
>>>> The json fields and structure is similar to the ganesha export
>>>> configuration structure.
>>>>
>>>> We also have other commands:
>>>>
>>>> # list all exports
>>>> $ ./run-backend-api-request.sh GET /api/nfs-ganesha/export
>>>>
>>>> # get an export
>>>> $ ./run-backend-api-request.sh GET \
>>>> 	/api/nfs-ganesha/export/<hostname>/<id>
>>>>
>>>> # update an export
>>>> $ ./run-backend-api-request.sh PUT \
>>>> 	/api/nfs-ganesha/export/<hostname>/<id> <json string>
>>>>
>>>> # delete an export
>>>> $ ./run-backend-api-request.sh DELETE \
>>>> 	/api/nfs-ganesha/export/<hostname>/<id>
>>>>
>>>>
>>>> In the dashboard implementation, the server configuration is identified
>>>> by the <hostname> field, which does not need to be a real hostname.
>>>> The dashboard keeps a map between the hostname and the rados object URL
>>>> that stores the configuration of the server.
>>>>
>>>
>>> Ok, that all sounds fine, actually. I think we can probably live with
>>> the REST API for this.
>>>
>>> It might be good to rename the "hostname" field to something more
>>> generic (maybe nodeid). The rados_cluster recovery backend for ganesha
>>> requires a unique nodeid for each node. If it's not specified then it
>>> will use the hostname.
>>
>> Sounds good to me.
>>
>>>> The bootstrap of this host/rados_url map can be done in two ways:
>>>> a) automatically: when an orchestrator backend is avaliable, the
>>>> dashboard asks the orchestrator for this information.
>>>> b) manually: the dashboard provides some CLI commands to add this
>>>> information. Example:
>>>> $  ceph dashboard ganesha-host-add <hostname> <rados_url>
>>>>
>>>
>>> I'll have to think about this bit.
>>>
>>> The main use case we're interested in currently is Openstack Manila:
>>>
>>>     https://wiki.openstack.org/wiki/Manila
>>>
>>> It has its own REST API, and admins can request new servers to be
>>> started or volumes to be created and exported.
>>>
>>> What I had envisioned was that requests to manila would get translated
>>> into requests to do things like:
>>>
>>> create volumes and subvolumes
>>> ask the orchestrator to spin up a new daemon
>>> modify the conf-* objects and ask the orchestrator to send daemons a
>>> SIGHUP
>>>
>>> I think what you're proposing should be fine there. Probably I just need
>>> to pull down your wip branch and play with it to better understand.
>>
>> I think all should work in the above use case if the dashboard is using
>> the orchestrator. After the orchestrator spins the new daemon, the
>> dashboard will have access to the new daemon configuration without
>> manual intervention.
>>
>>>>> My thinking was that we'd probably want to create a new mgr module for
>>>>> that, and could wire it up to the command line with something like:
>>>>>
>>>>>     $ ceph nfsexport create --id=100			\
>>>>> 			--pool=mypool			\
>>>>> 			--namespace=mynamespace		\
>>>>> 			--type=cephfs			\
>>>>> 			--volume=myfs			\
>>>>> 			--subvolume=/foo		\
>>>>> 			--pseudo=/foo			\
>>>>> 			--cephx_userid=admin		\
>>>>> 			--cephx_key=<base64 key>	\
>>>>> 			--client=10.0.0.0/8,ro,root	\
>>>>> 			--client=admhost,rw,none
>>>>>
>>>>> ...the "client" is a string that would be a tuple of client access
>>>>> string, r/o or r/w, and the userid squashing mode, and could be
>>>>> specified multiple times.
>>>>
>>>> The above command is similar to what we provide in the REST API with the
>>>> difference that the dashboard generates the export ID.
>>>>
>>>> Do you think it is important for the user to explicitly specify the
>>>> export ID?
>>>>
>>>
>>> No, it'd be fine to autogenerate those in some fashion.
>>>
>>>>> We'd also want to add a way to remove and enumerate exports. Maybe:
>>>>>
>>>>>     $ ceph nfsexport ls
>>>>>     $ ceph nfsexport rm --id=100
>>>>>
>>>>> So the create command above would create an object called "export-100"
>>>>> in the given rados_pool/rados_namespace. 
>>>>>
>>>>> From there, we'd need to also be able to "link" and "unlink" these
>>>>> export objects into the config files for each daemon. So if I have a
>>>>> cluster of 2 servers with nodeids "a" and "b":
>>>>>
>>>>>     $ ceph nfsexport link --pool=mypool			\
>>>>> 			--namespace=mynamespace		\
>>>>> 			--id=100 			\
>>>>> 			--node=a			\
>>>>> 			--node=b
>>>>>
>>>>> ...with a corresponding "unlink" command. That would append objects
>>>>> called "conf-a" and "conf-b" with this line:
>>>>>
>>>>>     %url rados://mypool/mynamespace/export-100
>>>>>
>>>>> ...and then call into the orchestrator to send a SIGHUP to the daemons
>>>>> to make them pick up the new configs. We might also want to sanity check
>>>>> whether any conf-* files are still linked to the export-* files before
>>>>> removing those objects.
>>>>>
>>>>> Thoughts?
>>>>
>>>> I got a bit lost with this link/unlink part. In the current dashboard
>>>> implementation, when we create an export the implementation will add the
>>>> export configuration into the rados://<pool>/<namespace>/conf-<nodeid>
>>>> object and call the orchestrator to update/restart the service.
>>>>
>>>> It looks to me that you are separating the export creation from the
>>>> export deployment. First you create the export, and then you add it to
>>>> the service configuration.
>>>>
>>>> We can also implement this two-step behavior in the dashboard
>>>> implementation and in the dashboard Web UI we can have a checkbox where
>>>> the user can specify if it wants to apply the new export right away or not.
>>>>
>>>> In the dashboard, we will also implement a "copy" command to copy an
>>>> export configuration to another ganesha server. That will help with
>>>> creating similar exports in different servers.
>>>> Another option would be to instead of having a single "<hostname>" field
>>>> in the create export function, to have a list of <hostname>s.
>>>>
>>>
>>> The two-step process was not so much for immediacy as to eliminate the
>>> need to replicate all of the EXPORT blocks across a potentially large
>>> series of objects. If we (e.g.) needed to modify a CLIENT block to allow
>>> a new subnet to have access, we'd only need to change the one object and
>>> then SIGHUP all of the daemons.
>>>
>>> That said, if replicating those blocks across multiple objects is
>>> simpler then we'll adapt.
>>
>> After thinking more about this, I think the approach you suggest about
>> using an object for each export and then link it to the servers
>> configuration makes more sense, and avoids the need of a "Copy Export"
>> operation.
>>
>> Since you will also consume the dashboard REST API besides the dashboard
>> frontend, I'll open a PR with just the backend implementation, so that
>> it can be merged quickly without waiting for the frontend to be ready.
>>
>>
> 
> Thanks! I spent some time playing around with your branch and I think it
> looks like it'll work just fine for us.
> 
> Just a few notes for anyone else that wants to do this. As Ricardo
> mentioned separately, RADOS namespace support is not yet plumbed in, so
> I'm using a separate "nfs-ganesha" pool here to house the RADOS objects
> needed for ganesha's configs and recovery backend:
> 
> $ MON=3 OSD=1 MGR=1 MDS=1 ../src/vstart.sh -n
> 
> $ ganesha-rados-grace -p nfs-ganesha `hostname`
> 
> $ rados create -p nfs-ganesha conf-`hostname`
> 
> $ ceph dashboard ganesha-host-add `hostname` rados://nfs-ganesha/conf-`hostname`
> 
> $  ./run-backend-api-request.sh POST /api/nfs-ganesha/export "`cat ~/export.json`"
> 
> ...where export.json is something like:
> 
> -----------------[snip]------------------
> {
>      "hostname": "server_hostname",
>      "path": "/foo",
>      "fsal": {"name": "CEPH", "user_id":"admin", "fs_name": "myfs"},
>      "pseudo": "/foo",
>      "tag": null,
>      "access_type": "RW",
>      "squash": "no_root_squash",
>      "protocols": [4],
>      "transports": ["TCP"],
>      "clients": [{
>        "addresses":["10.0.0.0/8"],
>        "access_type": "RO",
>        "squash": "root"
>      }]
> }
> -----------------[snip]-------------------
> 
> This creates an object with a ganesha config EXPORT block which looks
> valid. We may need to tweak it a bit, but I think this should work just
> fine.

Thanks for posting the above steps!

> 
> I know the web UI is still pretty raw, but here are some comments
> anyway:
> 
> For safety reasons, the default Access Type should probably be "RO", and
> the default Squash mode should be "Root" or maybe even "All". You may
> also want to somehow ensure that the admin consciously decides to export
> to the world instead of making that the default when no client is
> specified.

This is very valuable information. I never administrated an NFS ganesha
server and therefore don't have experience on what should be the
defaults. Thanks for the suggestions.

> 
> It'd be nice to be able to granularly select the NFSv4 minorversions. If
> you exclusively have NFSv4.1+ clients, then the grace period can be
> lifted early after a restart. That's a big deal for continuous
> operation. In our clustered configurations, we plan to not support
> anything before v4.1 by default.

I didn't know about the existence of minorversions. Where can get the
list of all possible values for the protocol version?

> 
> We'll probably need some way to specify the fsal.user_id field in the UI
> too. Maybe a dropdown box that enumerates the available principals?

Yes, and that's already been done by Tiago Melo (tmelo on IRC) in its
development branch. I believe he has added a dropdown with the list of
cephx users.

> 
> That's all for now. I think what I'll probably do is close out my PR to
> add NFS support to the orchestrator and concentrate on wiring the rook
> orchestrator into what you have, since it's more complete.
> 
> Cheers!
> 

-- 
Ricardo Dias
Senior Software Engineer - Storage Team
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284
(AG Nürnberg)

Attachment:
signature.asc

Description: OpenPGP digital signature