> On Mar 24, 2016, at 18:26, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: > > On Thu, Mar 24, 2016 at 3:33 AM, Yan, Zheng <zyan@xxxxxxxxxx> wrote: >> >>> On Mar 24, 2016, at 03:27, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: >>> >>> On Wed, Mar 23, 2016 at 2:02 PM, Yan, Zheng <zyan@xxxxxxxxxx> wrote: >>>> >>>>> On Mar 23, 2016, at 17:51, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: >>>>> >>>>> On Wed, Mar 23, 2016 at 4:37 AM, Yan, Zheng <zyan@xxxxxxxxxx> wrote: >>>>>> >>>>>>> On Mar 23, 2016, at 00:13, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: >>>>>>> >>>>>>> On Tue, Mar 22, 2016 at 2:57 PM, Yan, Zheng <zyan@xxxxxxxxxx> wrote: >>>>>>>> >>>>>>>>> On Mar 22, 2016, at 19:00, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: >>>>>>>>> >>>>>>>>> On Tue, Mar 22, 2016 at 10:17 AM, Yan, Zheng <zyan@xxxxxxxxxx> wrote: >>>>>>>>>> >>>>>>>>>>> On Mar 22, 2016, at 14:05, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: >>>>>>>>> >>>>>>>>> [ snip ] >>>>>>>>> >>>>>>>>>>> Hi Zheng, >>>>>>>>>>> >>>>>>>>>>> A one and a half line commit message and an equally short cover letter >>>>>>>>>>> for a series such as this isn't enough. I *happen* to know that the >>>>>>>>>>> basic use case for namespaces in cephfs is going to be restricting >>>>>>>>>>> users to different parts of the directory tree, with the enforcement >>>>>>>>>>> happening in ceph on the server side, as opposed to in ceph on the >>>>>>>>>>> client side, but I would appreciate some details on what the actual >>>>>>>>>>> namespace names are going to be, whether it's user input or not, >>>>>>>>>>> whether there are plans to use namespaces for anything else, etc. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> The namespace restriction you mentioned is for cephfs metadata. This >>>>>>>>>> namespace is restricting users to different namespaces in cephfs data >>>>>>>>>> pool. (At present, the only way to restrict data access is, creating >>>>>>>>>> multiple cephfs data pools, restrict users to different data pool. >>>>>>>>>> Creating lost of pools is not efficient for the cluster) >>>>>>>>> >>>>>>>>> What about the namespace names, who is generating them, how long are >>>>>>>>> they going to be? Please describe in detail how this is going to work >>>>>>>>> for both data and metadata pools. >>>>>>>> >>>>>>>> For example, to restrict user foo to directory /foo_dir >>>>>>>> >>>>>>>> // create auth caps for user foo. >>>>>>>> ceph auth get-or-create client.foo mon 'allow r' mds 'allow r, allow rw path=/foo_dir' osd 'allow rw pool=data namespace=foo_ns’ >>>>>>>> >>>>>>>> // mount cephfs by user admin >>>>>>>> mount -t ceph 10.0.0.1:/ /mnt/ceph_mount -o name=admin,secret=xxxx >>>>>>>> >>>>>>>> // set directory’s layout.pool_namespace >>>>>>>> setfattr -n ceph.dir.pool_namespace -v foo_ns /mnt/ceph_mount/foo_dir >>>>>>>> >>>>>>>> Admin user chooses namespace name. In most cases, namespace name does not change. >>>>>>> >>>>>>> Good, I guess limiting it to 100 chars (or maybe even a smaller >>>>>>> number) is sensible then. >>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> I don't like the idea of sharing namespace name strings between libceph >>>>>>>>>>> and ceph modules, especially with the strings themselves hosted in >>>>>>>>>>> libceph. rbd has no use for namespaces, so libceph can live with >>>>>>>>>>> namespace names embedded into ceph_osd_request by value or with >>>>>>>>>>> a simple non-owning pointer, leaving reference counting to the outside >>>>>>>>>>> modules, if one of the use cases is "one namespace with a long name for >>>>>>>>>>> the entire directory tree" or something close to it. >>>>>>>>>>> >>>>>>>>>>> I think the sharing infrastructure should be moved into cephfs, and >>>>>>>>>>> probably changed to share entire file layouts along the way. I don't >>>>>>>>>>> know this code well enough to be sure, but it seems that by sharing >>>>>>>>>>> file layouts and making ci->i_layout an owning pointer you might be >>>>>>>>>>> able to piggy back on i_ceph_lock and considerably simplify the whole >>>>>>>>>>> thing by dropping RCU and eliminating numerous get/put calls. >>>>>>>>>> >>>>>>>>>> RBD may use namespace later. >>>>>>>>>> http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_namespace_support >>>>>>>>> >>>>>>>>> Well, compared to cephfs, it's hard to call that "using" - in that >>>>>>>>> case, there will only ever be one namespace per image. My point is >>>>>>>>> it's never going to use the string sharing infrastructure and is fine >>>>>>>>> with a non-owning pointer to a string in the file layout field of the >>>>>>>>> in-memory rbd image header. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> The reason I use RCU here is that ci->i_layout.pool_ns can change at >>>>>>>>>> any time. For the same reason, using non-owning pointer for namespace >>>>>>>>>> or entire layout is unfeasible. Using embedded namespace is not >>>>>>>>>> elegant either. When allocating ceph_osd_request, cephfs needs to >>>>>>>>>> lock i_ceph_lock, copy namespace to a temporary buffer, unlock >>>>>>>>>> i_ceph_lock, pass ci->i_layout and the temporary buffer to the >>>>>>>>>> ceph_osdc_xxx_request(). >>>>>>>>> >>>>>>>>> Hmm, RCU doesn't protect you from pool_ns or other file layout fields >>>>>>>>> changing while the OSD request is in flight. As used above, it allows >>>>>>>>> ceph_try_get_string() to not take any locks and that's it. >>>>>>>> >>>>>>>> Yes. But not taking lock simplify the code a lot. we don't need to >>>>>>>> lock/unlock i_ceph_lock each time i_layout is used. >>>>>>> >>>>>>> I wouldn't call a bunch of rcu_dereference_* variants sprinkled around >>>>>>> the code base a simplification, but, more importantly, is keeping the >>>>>>> pool_ns pointer valid really all you need? Shouldn't there be some >>>>>>> kind of synchronization around "OK, I'm switching to a new layout for >>>>>>> this inode"? As it is, pool_ns is grabbed in ceph_osdc_new_request(), >>>>>>> with two successive calls to ceph_osdc_new_request() potentially ending >>>>>>> up with two different namespaces, e.g. ceph_uninline_data(). >>>>>> >>>>>> There is synchronisation. When changing file layout, MDS revokes Frw caps from client (block new read/write, for in-progress read/write). But this synchronisation is not complete reliable when client session state is toggled between stale and active. >>>>> >>>>> I have a bit of trouble parsing "block new read/write, for in-progress >>>>> read/write". So the client will stop issuing requests as soon as it >>>>> learns that it no longer has a cap, but what happens with the in-flight >>>>> requests? >>>> >>>> When client know MDS is revoking Frw caps, it stops issuing new request and waits for in-flight requests. After all in-flight requests completes, client releases Frw caps to MDS. >>>> >>>>> >>>>>> >>>>>>>>> >>>>>>>>> Why exactly can't file layouts be shared? ci->i_layout would become >>>>>>>>> reference counted and you would give libceph a pointer to the pool_ns >>>>>>>>> string (or entire layout) after bumping it. It doesn't matter if >>>>>>>>> pool_ns or the rest of the layout changes due to a cap grant or revoke >>>>>>>>> while libceph is servicing the OSD request: you would unlink it from >>>>>>>>> the tree but the bumped reference will keep the layout around, to be >>>>>>>>> put in the OSD request completion callback or so. Layout lookup would >>>>>>>>> have to happen in exactly two places: when newing an inode and handling >>>>>>>>> cap grant/revoke, in other places you would simply bump the count on >>>>>>>>> the basis of already holding a valid pointer. You wouldn't have to >>>>>>>>> unlink in the destructor, so no hassle with kref_get_unless_zero() and >>>>>>>>> no need for RCU, with i_ceph_lock providing the exclusion around the >>>>>>>>> tree. >>>>>>>> >>>>>>>> This means cephfs needs to set r_callback for all ceph_osd_request, >>>>>>>> ceph_osd_request also needs a new field to store layout pointer. >>>>>>>> I don’t think it’s better/simpler than reference counted namespace >>>>>>>> string. >>>>>>> >>>>>>> Not necessarily - you can put after ceph_osdc_wait_request() returns. >>>>>>> Somewhat unrelated, I'm working on refactoring osdc's handle_reply(), >>>>>>> and it'll probably be required that all OSD requests set one of the >>>>>>> callbacks, except for stateless fire-and-forget ones. >>>>>> >>>>>> For the r_callback case (no wait case), without saving a pointer in ceph_osd_request, how can I know which layout to put? >>>>>> >>>>>>> >>>>>>> Sharing ->i_layout as opposed to ->i_layout->pool_ns seemed sensible to >>>>>>> me because a) it naturally hangs off of ceph inode and b) logically, >>>>>>> it is entire layouts and not just namespaces that are shared across the >>>>>>> directory tree. If you think reference counted pool_ns strings are >>>>>>> better, I won't argue with that, but, with cephfs being the only user >>>>>>> of either solution, it'll have to live in fs/ceph. >>>>>> >>>>>> I’m OK with both approaches. When sharing i_layout, we need to add a layout pointer to ceph_osd_request. After adding the layout pointer, why not let libceph release it when request finishes. >>>>>> >>>>>>> Separately, I think passing non-owning pool_ns pointers into libceph is >>>>>>> worth exploring, but if that doesn't easily map onto cephfs lifetime or >>>>>>> ownership rules, we will go with embedding namespace names by value into >>>>>>> ceph_osd_request (or, rather, ceph_object_locator). >>>>>>> >>>>>> >>>>>> As I stated in previous mail, embedded namespace is nightmare for cephfs. Every time namespace is used, cephfs needs to lock i_ceph_lock, copy namespace to a temporary buffer. >>>>> >>>>> So you are maintaining that all that is needed is to keep the memory >>>>> valid and there is no locking around installing a new namespace for an >>>>> inode. I didn't realize that when I suggested layout sharing, it makes >>>>> it much less attractive. >>>> >>>> Yes. That’s the main reason I decided to use RCU. >>> >>> For the record, I don't think it's a good design and I doubt the >>> implementation is going to work reliably, but that's your call. >>> >>> Why would embedded namespaces in ceph_object_locator in libceph be >>> a nightmare for cephfs? What do you refer to as a temporary buffer? >>> This kind of copying already occurs: you grab a ceph_string with >>> ceph_try_get_string() in ceph_osdc_new_request() and it's copied into >>> the request message, as part of encoding. How is grabbing ceph_string >>> before calling into libceph and explicitly copying into object locator >>> different? >> >> I mean not using reference-counted ceph_string. >> >> >> Yes, we can make cephfs code manage the reference counting, call ceph_try_get_string() and ceph_put_string() each time we issue osd request. But this approach requires more code/overhead for both ceph and libceph (an extra parameter for ceph_osdc_xxx_request() functions, code that copies namespace to embedded buffer, lots of ceph_try_get_string()/ceph_put_string() pair in cephfs code). All of this is just for removing one ceph_try_get_string() call and two ceph_put_string() call in libceph. does it worth the effort, does it make libceph code cleaner? > > It'll be printed into the request after it's allocated, just like OID > is now, so e.g. ceph_osdc_alloc_request() won't need any changes. You > may need to patch ceph_osdc_new_request(), but that's a cephfs-specific > function and should probably be moved out of libceph. You won't need > to add lots of get/put pairs in cephfs either - a single helper around > filling in an object locator would do it. > > The two disadvantages (extra strcpy() and larger OSD requests) are > real, but so are the benefits of modularity and keeping object locator > a simple value type. I'm just trying to explore all available options. I agree that it’s easy for us to explore both options if we move ceph_osdc_new_request(), ceph_osdc_readpage and ceph_osdc_writepage into fs/ceph directory, > > TBH I still can't wrap my head around lockless layout/namespace pointer > updates and how that can ever work reliably… see Documentation/RCU/rcuref.txt. How RCU is used in my code is similar to the second example. Regards Yan, Zheng > Thanks, > > Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html