> On Mar 24, 2016, at 03:27, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: > > On Wed, Mar 23, 2016 at 2:02 PM, Yan, Zheng <zyan@xxxxxxxxxx> wrote: >> >>> On Mar 23, 2016, at 17:51, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: >>> >>> On Wed, Mar 23, 2016 at 4:37 AM, Yan, Zheng <zyan@xxxxxxxxxx> wrote: >>>> >>>>> On Mar 23, 2016, at 00:13, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: >>>>> >>>>> On Tue, Mar 22, 2016 at 2:57 PM, Yan, Zheng <zyan@xxxxxxxxxx> wrote: >>>>>> >>>>>>> On Mar 22, 2016, at 19:00, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: >>>>>>> >>>>>>> On Tue, Mar 22, 2016 at 10:17 AM, Yan, Zheng <zyan@xxxxxxxxxx> wrote: >>>>>>>> >>>>>>>>> On Mar 22, 2016, at 14:05, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: >>>>>>> >>>>>>> [ snip ] >>>>>>> >>>>>>>>> Hi Zheng, >>>>>>>>> >>>>>>>>> A one and a half line commit message and an equally short cover letter >>>>>>>>> for a series such as this isn't enough. I *happen* to know that the >>>>>>>>> basic use case for namespaces in cephfs is going to be restricting >>>>>>>>> users to different parts of the directory tree, with the enforcement >>>>>>>>> happening in ceph on the server side, as opposed to in ceph on the >>>>>>>>> client side, but I would appreciate some details on what the actual >>>>>>>>> namespace names are going to be, whether it's user input or not, >>>>>>>>> whether there are plans to use namespaces for anything else, etc. >>>>>>>>> >>>>>>>> >>>>>>>> The namespace restriction you mentioned is for cephfs metadata. This >>>>>>>> namespace is restricting users to different namespaces in cephfs data >>>>>>>> pool. (At present, the only way to restrict data access is, creating >>>>>>>> multiple cephfs data pools, restrict users to different data pool. >>>>>>>> Creating lost of pools is not efficient for the cluster) >>>>>>> >>>>>>> What about the namespace names, who is generating them, how long are >>>>>>> they going to be? Please describe in detail how this is going to work >>>>>>> for both data and metadata pools. >>>>>> >>>>>> For example, to restrict user foo to directory /foo_dir >>>>>> >>>>>> // create auth caps for user foo. >>>>>> ceph auth get-or-create client.foo mon 'allow r' mds 'allow r, allow rw path=/foo_dir' osd 'allow rw pool=data namespace=foo_ns’ >>>>>> >>>>>> // mount cephfs by user admin >>>>>> mount -t ceph 10.0.0.1:/ /mnt/ceph_mount -o name=admin,secret=xxxx >>>>>> >>>>>> // set directory’s layout.pool_namespace >>>>>> setfattr -n ceph.dir.pool_namespace -v foo_ns /mnt/ceph_mount/foo_dir >>>>>> >>>>>> Admin user chooses namespace name. In most cases, namespace name does not change. >>>>> >>>>> Good, I guess limiting it to 100 chars (or maybe even a smaller >>>>> number) is sensible then. >>>>> >>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> I don't like the idea of sharing namespace name strings between libceph >>>>>>>>> and ceph modules, especially with the strings themselves hosted in >>>>>>>>> libceph. rbd has no use for namespaces, so libceph can live with >>>>>>>>> namespace names embedded into ceph_osd_request by value or with >>>>>>>>> a simple non-owning pointer, leaving reference counting to the outside >>>>>>>>> modules, if one of the use cases is "one namespace with a long name for >>>>>>>>> the entire directory tree" or something close to it. >>>>>>>>> >>>>>>>>> I think the sharing infrastructure should be moved into cephfs, and >>>>>>>>> probably changed to share entire file layouts along the way. I don't >>>>>>>>> know this code well enough to be sure, but it seems that by sharing >>>>>>>>> file layouts and making ci->i_layout an owning pointer you might be >>>>>>>>> able to piggy back on i_ceph_lock and considerably simplify the whole >>>>>>>>> thing by dropping RCU and eliminating numerous get/put calls. >>>>>>>> >>>>>>>> RBD may use namespace later. >>>>>>>> http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_namespace_support >>>>>>> >>>>>>> Well, compared to cephfs, it's hard to call that "using" - in that >>>>>>> case, there will only ever be one namespace per image. My point is >>>>>>> it's never going to use the string sharing infrastructure and is fine >>>>>>> with a non-owning pointer to a string in the file layout field of the >>>>>>> in-memory rbd image header. >>>>>>> >>>>>>>> >>>>>>>> The reason I use RCU here is that ci->i_layout.pool_ns can change at >>>>>>>> any time. For the same reason, using non-owning pointer for namespace >>>>>>>> or entire layout is unfeasible. Using embedded namespace is not >>>>>>>> elegant either. When allocating ceph_osd_request, cephfs needs to >>>>>>>> lock i_ceph_lock, copy namespace to a temporary buffer, unlock >>>>>>>> i_ceph_lock, pass ci->i_layout and the temporary buffer to the >>>>>>>> ceph_osdc_xxx_request(). >>>>>>> >>>>>>> Hmm, RCU doesn't protect you from pool_ns or other file layout fields >>>>>>> changing while the OSD request is in flight. As used above, it allows >>>>>>> ceph_try_get_string() to not take any locks and that's it. >>>>>> >>>>>> Yes. But not taking lock simplify the code a lot. we don't need to >>>>>> lock/unlock i_ceph_lock each time i_layout is used. >>>>> >>>>> I wouldn't call a bunch of rcu_dereference_* variants sprinkled around >>>>> the code base a simplification, but, more importantly, is keeping the >>>>> pool_ns pointer valid really all you need? Shouldn't there be some >>>>> kind of synchronization around "OK, I'm switching to a new layout for >>>>> this inode"? As it is, pool_ns is grabbed in ceph_osdc_new_request(), >>>>> with two successive calls to ceph_osdc_new_request() potentially ending >>>>> up with two different namespaces, e.g. ceph_uninline_data(). >>>> >>>> There is synchronisation. When changing file layout, MDS revokes Frw caps from client (block new read/write, for in-progress read/write). But this synchronisation is not complete reliable when client session state is toggled between stale and active. >>> >>> I have a bit of trouble parsing "block new read/write, for in-progress >>> read/write". So the client will stop issuing requests as soon as it >>> learns that it no longer has a cap, but what happens with the in-flight >>> requests? >> >> When client know MDS is revoking Frw caps, it stops issuing new request and waits for in-flight requests. After all in-flight requests completes, client releases Frw caps to MDS. >> >>> >>>> >>>>>>> >>>>>>> Why exactly can't file layouts be shared? ci->i_layout would become >>>>>>> reference counted and you would give libceph a pointer to the pool_ns >>>>>>> string (or entire layout) after bumping it. It doesn't matter if >>>>>>> pool_ns or the rest of the layout changes due to a cap grant or revoke >>>>>>> while libceph is servicing the OSD request: you would unlink it from >>>>>>> the tree but the bumped reference will keep the layout around, to be >>>>>>> put in the OSD request completion callback or so. Layout lookup would >>>>>>> have to happen in exactly two places: when newing an inode and handling >>>>>>> cap grant/revoke, in other places you would simply bump the count on >>>>>>> the basis of already holding a valid pointer. You wouldn't have to >>>>>>> unlink in the destructor, so no hassle with kref_get_unless_zero() and >>>>>>> no need for RCU, with i_ceph_lock providing the exclusion around the >>>>>>> tree. >>>>>> >>>>>> This means cephfs needs to set r_callback for all ceph_osd_request, >>>>>> ceph_osd_request also needs a new field to store layout pointer. >>>>>> I don’t think it’s better/simpler than reference counted namespace >>>>>> string. >>>>> >>>>> Not necessarily - you can put after ceph_osdc_wait_request() returns. >>>>> Somewhat unrelated, I'm working on refactoring osdc's handle_reply(), >>>>> and it'll probably be required that all OSD requests set one of the >>>>> callbacks, except for stateless fire-and-forget ones. >>>> >>>> For the r_callback case (no wait case), without saving a pointer in ceph_osd_request, how can I know which layout to put? >>>> >>>>> >>>>> Sharing ->i_layout as opposed to ->i_layout->pool_ns seemed sensible to >>>>> me because a) it naturally hangs off of ceph inode and b) logically, >>>>> it is entire layouts and not just namespaces that are shared across the >>>>> directory tree. If you think reference counted pool_ns strings are >>>>> better, I won't argue with that, but, with cephfs being the only user >>>>> of either solution, it'll have to live in fs/ceph. >>>> >>>> I’m OK with both approaches. When sharing i_layout, we need to add a layout pointer to ceph_osd_request. After adding the layout pointer, why not let libceph release it when request finishes. >>>> >>>>> Separately, I think passing non-owning pool_ns pointers into libceph is >>>>> worth exploring, but if that doesn't easily map onto cephfs lifetime or >>>>> ownership rules, we will go with embedding namespace names by value into >>>>> ceph_osd_request (or, rather, ceph_object_locator). >>>>> >>>> >>>> As I stated in previous mail, embedded namespace is nightmare for cephfs. Every time namespace is used, cephfs needs to lock i_ceph_lock, copy namespace to a temporary buffer. >>> >>> So you are maintaining that all that is needed is to keep the memory >>> valid and there is no locking around installing a new namespace for an >>> inode. I didn't realize that when I suggested layout sharing, it makes >>> it much less attractive. >> >> Yes. That’s the main reason I decided to use RCU. > > For the record, I don't think it's a good design and I doubt the > implementation is going to work reliably, but that's your call. > > Why would embedded namespaces in ceph_object_locator in libceph be > a nightmare for cephfs? What do you refer to as a temporary buffer? > This kind of copying already occurs: you grab a ceph_string with > ceph_try_get_string() in ceph_osdc_new_request() and it's copied into > the request message, as part of encoding. How is grabbing ceph_string > before calling into libceph and explicitly copying into object locator > different? I mean not using reference-counted ceph_string. Yes, we can make cephfs code manage the reference counting, call ceph_try_get_string() and ceph_put_string() each time we issue osd request. But this approach requires more code/overhead for both ceph and libceph (an extra parameter for ceph_osdc_xxx_request() functions, code that copies namespace to embedded buffer, lots of ceph_try_get_string()/ceph_put_string() pair in cephfs code). All of this is just for removing one ceph_try_get_string() call and two ceph_put_string() call in libceph. does it worth the effort, does it make libceph code cleaner? Regards Yan,Zheng-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html