On 10/01/19 15:30, Jeff Layton wrote: > On Wed, 2019-01-09 at 23:37 +0000, Ricardo Dias wrote: >> >> On 08/01/19 14:13, Jeff Layton wrote: >>> On Fri, 2019-01-04 at 12:35 +0000, Ricardo Dias wrote: >>>> On 04/01/19 12:05, Jeff Layton wrote: >>>>> On Fri, 2019-01-04 at 10:35 +0000, Ricardo Dias wrote: >>>>>> On 03/01/19 19:05, Jeff Layton wrote: >>>>>>> Ricardo, >>>>>>> >>>>>>> We chatted earlier about a new ceph mgr module that would spin up EXPORT >>>>>>> blocks for ganesha and stuff them into a RADOS object. Here's basically >>>>>>> what we're aiming for. I think it's pretty similar to what SuSE's >>>>>>> solution is doing so I think it'd be good to collaborate here. >>>>>> >>>>>> Just to make things more clear, We (SUSE) didn't implement a specific >>>>>> downstream implementation. The implementation we developed targets the >>>>>> upstream ceph-dashboard code. >>>>>> >>>>>> The dashboard backend code to manage ganesha exports is almost done. We >>>>>> still haven't opened a PR because we are still finishing the frontend >>>>>> code, which might make the backend to change a bit. >>>>>> >>>>>> The current code is located here: >>>>>> https://github.com/rjfd/ceph/tree/wip-dashboard-nfs >>>>>> >>>>>>> Probably I should write this up in a doc somewhere, but here's what I'd >>>>>>> envision. First an overview: >>>>>>> >>>>>>> The Rook.io ceph+ganesha CRD basically spins up nfs-ganesha pods under >>>>>>> k8s that don't export anything by default and have a fairly stock >>>>>>> config. Each ganesha daemon that is started has a boilerplate config >>>>>>> file that ends with a %url include like this: >>>>>>> >>>>>>> %url rados://<pool>/<namespace>/conf-<nodeid> >>>>>>> >>>>>>> The nodeid in this case is the unique nodeid within a cluster of ganesha >>>>>>> servers using the rados_cluster recovery backend in ganesha. Rook >>>>>>> enumerates these starting with 'a' and going through 'z' (and then 'aa', >>>>>>> 'ab', etc.). So node 'a' would have a config object called "conf-a". >>>>>>> >>>>>> >>>>>> This was the same assumption we made, and the current implementation >>>>>> code can manage the exports of different servers (configuration objects). >>>>>> >>>>>>> What we currently lack is the code to set up those conf-<nodeid> >>>>>>> objects. I know you have some code to do this sort of configuration via >>>>>>> the dashboard and a REST API. Would it make more sense to split this bit >>>>>>> out into a separate module, which would also allow it to be usable from >>>>>>> the command line? >>>>>> >>>>>> Yes, and no :) I think the benefit of splitting the code into a separate >>>>>> module is on the possibility of other mgr modules to manage ganesha >>>>>> exports using the mgr "_remote" call infrastructure, or if someone wants >>>>>> to manage ganesha exports without enabling the dashboard module. >>>>>> >>>>>> Regarding CLI commands, since the dashboard code exposes the export >>>>>> management through a REST API, we can always use curl to call it >>>>>> (although it will be a more verbose command). >>>>>> >>>>>> In the dashboard source directory we have a small bash script to help >>>>>> calling the REST API from the CLI. Here's an example of an export >>>>>> creation using the current implementation: >>>>>> >>>>>> $ ./run-backend-api-request.sh POST /api/nfs-ganesha/export \ >>>>>> '{ >>>>>> "hostname": "node1.domain", \ >>>>>> "path": "/foo", \ >>>>>> "fsal": {"name": "CEPH", "user_id":"admin", "fs_name": "myfs"}, \ >>>>>> >>>>>> "pseudo": "/foo", \ >>>>>> "tag": null, \ >>>>>> "access_type": "RW", \ >>>>>> "squash": "no_root_squash", \ >>>>>> "protocols":[4], \ >>>>>> "transports": ["TCP"], \ >>>>>> "clients": [{ \ >>>>>> "addresses":["10.0.0.0/8"], \ >>>>>> "access_type": "RO", \ >>>>>> "squash": "root" \ >>>>>> }]}' >>>>>> >>>>>> The json fields and structure is similar to the ganesha export >>>>>> configuration structure. >>>>>> >>>>>> We also have other commands: >>>>>> >>>>>> # list all exports >>>>>> $ ./run-backend-api-request.sh GET /api/nfs-ganesha/export >>>>>> >>>>>> # get an export >>>>>> $ ./run-backend-api-request.sh GET \ >>>>>> /api/nfs-ganesha/export/<hostname>/<id> >>>>>> >>>>>> # update an export >>>>>> $ ./run-backend-api-request.sh PUT \ >>>>>> /api/nfs-ganesha/export/<hostname>/<id> <json string> >>>>>> >>>>>> # delete an export >>>>>> $ ./run-backend-api-request.sh DELETE \ >>>>>> /api/nfs-ganesha/export/<hostname>/<id> >>>>>> >>>>>> >>>>>> In the dashboard implementation, the server configuration is identified >>>>>> by the <hostname> field, which does not need to be a real hostname. >>>>>> The dashboard keeps a map between the hostname and the rados object URL >>>>>> that stores the configuration of the server. >>>>>> >>>>> >>>>> Ok, that all sounds fine, actually. I think we can probably live with >>>>> the REST API for this. >>>>> >>>>> It might be good to rename the "hostname" field to something more >>>>> generic (maybe nodeid). The rados_cluster recovery backend for ganesha >>>>> requires a unique nodeid for each node. If it's not specified then it >>>>> will use the hostname. >>>> >>>> Sounds good to me. >>>> >>>>>> The bootstrap of this host/rados_url map can be done in two ways: >>>>>> a) automatically: when an orchestrator backend is avaliable, the >>>>>> dashboard asks the orchestrator for this information. >>>>>> b) manually: the dashboard provides some CLI commands to add this >>>>>> information. Example: >>>>>> $ ceph dashboard ganesha-host-add <hostname> <rados_url> >>>>>> >>>>> >>>>> I'll have to think about this bit. >>>>> >>>>> The main use case we're interested in currently is Openstack Manila: >>>>> >>>>> https://wiki.openstack.org/wiki/Manila >>>>> >>>>> It has its own REST API, and admins can request new servers to be >>>>> started or volumes to be created and exported. >>>>> >>>>> What I had envisioned was that requests to manila would get translated >>>>> into requests to do things like: >>>>> >>>>> create volumes and subvolumes >>>>> ask the orchestrator to spin up a new daemon >>>>> modify the conf-* objects and ask the orchestrator to send daemons a >>>>> SIGHUP >>>>> >>>>> I think what you're proposing should be fine there. Probably I just need >>>>> to pull down your wip branch and play with it to better understand. >>>> >>>> I think all should work in the above use case if the dashboard is using >>>> the orchestrator. After the orchestrator spins the new daemon, the >>>> dashboard will have access to the new daemon configuration without >>>> manual intervention. >>>> >>>>>>> My thinking was that we'd probably want to create a new mgr module for >>>>>>> that, and could wire it up to the command line with something like: >>>>>>> >>>>>>> $ ceph nfsexport create --id=100 \ >>>>>>> --pool=mypool \ >>>>>>> --namespace=mynamespace \ >>>>>>> --type=cephfs \ >>>>>>> --volume=myfs \ >>>>>>> --subvolume=/foo \ >>>>>>> --pseudo=/foo \ >>>>>>> --cephx_userid=admin \ >>>>>>> --cephx_key=<base64 key> \ >>>>>>> --client=10.0.0.0/8,ro,root \ >>>>>>> --client=admhost,rw,none >>>>>>> >>>>>>> ...the "client" is a string that would be a tuple of client access >>>>>>> string, r/o or r/w, and the userid squashing mode, and could be >>>>>>> specified multiple times. >>>>>> >>>>>> The above command is similar to what we provide in the REST API with the >>>>>> difference that the dashboard generates the export ID. >>>>>> >>>>>> Do you think it is important for the user to explicitly specify the >>>>>> export ID? >>>>>> >>>>> >>>>> No, it'd be fine to autogenerate those in some fashion. >>>>> >>>>>>> We'd also want to add a way to remove and enumerate exports. Maybe: >>>>>>> >>>>>>> $ ceph nfsexport ls >>>>>>> $ ceph nfsexport rm --id=100 >>>>>>> >>>>>>> So the create command above would create an object called "export-100" >>>>>>> in the given rados_pool/rados_namespace. >>>>>>> >>>>>>> From there, we'd need to also be able to "link" and "unlink" these >>>>>>> export objects into the config files for each daemon. So if I have a >>>>>>> cluster of 2 servers with nodeids "a" and "b": >>>>>>> >>>>>>> $ ceph nfsexport link --pool=mypool \ >>>>>>> --namespace=mynamespace \ >>>>>>> --id=100 \ >>>>>>> --node=a \ >>>>>>> --node=b >>>>>>> >>>>>>> ...with a corresponding "unlink" command. That would append objects >>>>>>> called "conf-a" and "conf-b" with this line: >>>>>>> >>>>>>> %url rados://mypool/mynamespace/export-100 >>>>>>> >>>>>>> ...and then call into the orchestrator to send a SIGHUP to the daemons >>>>>>> to make them pick up the new configs. We might also want to sanity check >>>>>>> whether any conf-* files are still linked to the export-* files before >>>>>>> removing those objects. >>>>>>> >>>>>>> Thoughts? >>>>>> >>>>>> I got a bit lost with this link/unlink part. In the current dashboard >>>>>> implementation, when we create an export the implementation will add the >>>>>> export configuration into the rados://<pool>/<namespace>/conf-<nodeid> >>>>>> object and call the orchestrator to update/restart the service. >>>>>> >>>>>> It looks to me that you are separating the export creation from the >>>>>> export deployment. First you create the export, and then you add it to >>>>>> the service configuration. >>>>>> >>>>>> We can also implement this two-step behavior in the dashboard >>>>>> implementation and in the dashboard Web UI we can have a checkbox where >>>>>> the user can specify if it wants to apply the new export right away or not. >>>>>> >>>>>> In the dashboard, we will also implement a "copy" command to copy an >>>>>> export configuration to another ganesha server. That will help with >>>>>> creating similar exports in different servers. >>>>>> Another option would be to instead of having a single "<hostname>" field >>>>>> in the create export function, to have a list of <hostname>s. >>>>>> >>>>> >>>>> The two-step process was not so much for immediacy as to eliminate the >>>>> need to replicate all of the EXPORT blocks across a potentially large >>>>> series of objects. If we (e.g.) needed to modify a CLIENT block to allow >>>>> a new subnet to have access, we'd only need to change the one object and >>>>> then SIGHUP all of the daemons. >>>>> >>>>> That said, if replicating those blocks across multiple objects is >>>>> simpler then we'll adapt. >>>> >>>> After thinking more about this, I think the approach you suggest about >>>> using an object for each export and then link it to the servers >>>> configuration makes more sense, and avoids the need of a "Copy Export" >>>> operation. >>>> >>>> Since you will also consume the dashboard REST API besides the dashboard >>>> frontend, I'll open a PR with just the backend implementation, so that >>>> it can be merged quickly without waiting for the frontend to be ready. >>>> >>>> >>> >>> Thanks! I spent some time playing around with your branch and I think it >>> looks like it'll work just fine for us. >>> >>> Just a few notes for anyone else that wants to do this. As Ricardo >>> mentioned separately, RADOS namespace support is not yet plumbed in, so >>> I'm using a separate "nfs-ganesha" pool here to house the RADOS objects >>> needed for ganesha's configs and recovery backend: >>> >>> $ MON=3 OSD=1 MGR=1 MDS=1 ../src/vstart.sh -n >>> >>> $ ganesha-rados-grace -p nfs-ganesha `hostname` >>> >>> $ rados create -p nfs-ganesha conf-`hostname` >>> >>> $ ceph dashboard ganesha-host-add `hostname` rados://nfs-ganesha/conf-`hostname` >>> >>> $ ./run-backend-api-request.sh POST /api/nfs-ganesha/export "`cat ~/export.json`" >>> >>> ...where export.json is something like: >>> >>> -----------------[snip]------------------ >>> { >>> "hostname": "server_hostname", >>> "path": "/foo", >>> "fsal": {"name": "CEPH", "user_id":"admin", "fs_name": "myfs"}, >>> "pseudo": "/foo", >>> "tag": null, >>> "access_type": "RW", >>> "squash": "no_root_squash", >>> "protocols": [4], >>> "transports": ["TCP"], >>> "clients": [{ >>> "addresses":["10.0.0.0/8"], >>> "access_type": "RO", >>> "squash": "root" >>> }] >>> } >>> -----------------[snip]------------------- >>> >>> This creates an object with a ganesha config EXPORT block which looks >>> valid. We may need to tweak it a bit, but I think this should work just >>> fine. >> >> Thanks for posting the above steps! >> >>> I know the web UI is still pretty raw, but here are some comments >>> anyway: >>> >>> For safety reasons, the default Access Type should probably be "RO", and >>> the default Squash mode should be "Root" or maybe even "All". You may >>> also want to somehow ensure that the admin consciously decides to export >>> to the world instead of making that the default when no client is >>> specified. >> >> This is very valuable information. I never administrated an NFS ganesha >> server and therefore don't have experience on what should be the >> defaults. Thanks for the suggestions. >> > > No problem. We definitely want this to be a "safe by default" design, as > much as possible. Getting exports wrong is a great way to compromise > security in some environments. > >>> It'd be nice to be able to granularly select the NFSv4 minorversions. If >>> you exclusively have NFSv4.1+ clients, then the grace period can be >>> lifted early after a restart. That's a big deal for continuous >>> operation. In our clustered configurations, we plan to not support >>> anything before v4.1 by default. >> >> I didn't know about the existence of minorversions. Where can get the >> list of all possible values for the protocol version? >> > > Those are all governed by the IETF RFCs. Basically we have v4.0, v4.1 > and v4.2 so far, and I wouldn't worry about anything beyond that at this > point. > > We may eventually end up with a v4.3, but we're sort of moving to a > model that is based on feature flags so that may not ever materialize. > >>> We'll probably need some way to specify the fsal.user_id field in the UI >>> too. Maybe a dropdown box that enumerates the available principals? >> >> Yes, and that's already been done by Tiago Melo (tmelo on IRC) in its >> development branch. I believe he has added a dropdown with the list of >> cephx users. >> > > Nice. > >>> That's all for now. I think what I'll probably do is close out my PR to >>> add NFS support to the orchestrator and concentrate on wiring the rook >>> orchestrator into what you have, since it's more complete. >>> > > FWIW...After I took a closer look, I think the PR I had to add NFS > support to the orchestrator is most orthogonal to your changes, so I > think we'll probably want to merge the latest version of it after all. Yes, I agree. -- Ricardo Dias Senior Software Engineer - Storage Team SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
Attachment:
signature.asc
Description: OpenPGP digital signature