Re: [PATCH v2 2/6] libceph: introduce reference counted string

"Yan, Zheng" <zyan@xxxxxxxxxx> · Wed, 23 Mar 2016 11:37:11 +0800

> On Mar 23, 2016, at 00:13, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> 
> On Tue, Mar 22, 2016 at 2:57 PM, Yan, Zheng <zyan@xxxxxxxxxx> wrote:
>> 
>>> On Mar 22, 2016, at 19:00, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>>> 
>>> On Tue, Mar 22, 2016 at 10:17 AM, Yan, Zheng <zyan@xxxxxxxxxx> wrote:
>>>> 
>>>>> On Mar 22, 2016, at 14:05, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>>> 
>>> [ snip ]
>>> 
>>>>> Hi Zheng,
>>>>> 
>>>>> A one and a half line commit message and an equally short cover letter
>>>>> for a series such as this isn't enough.  I *happen* to know that the
>>>>> basic use case for namespaces in cephfs is going to be restricting
>>>>> users to different parts of the directory tree, with the enforcement
>>>>> happening in ceph on the server side, as opposed to in ceph on the
>>>>> client side, but I would appreciate some details on what the actual
>>>>> namespace names are going to be, whether it's user input or not,
>>>>> whether there are plans to use namespaces for anything else, etc.
>>>>> 
>>>> 
>>>> The namespace restriction you mentioned is for cephfs metadata. This
>>>> namespace is restricting users to different namespaces in cephfs data
>>>> pool. (At present, the only way to restrict data access is, creating
>>>> multiple cephfs data pools, restrict users to different data pool.
>>>> Creating lost of pools is not efficient for the cluster)
>>> 
>>> What about the namespace names, who is generating them, how long are
>>> they going to be?  Please describe in detail how this is going to work
>>> for both data and metadata pools.
>> 
>> For example, to restrict user foo to directory /foo_dir
>> 
>> // create auth caps for user foo.
>> ceph auth get-or-create client.foo mon 'allow r' mds 'allow r, allow rw path=/foo_dir' osd 'allow rw pool=data namespace=foo_ns’
>> 
>> // mount cephfs by user admin
>> mount -t ceph 10.0.0.1:/ /mnt/ceph_mount -o name=admin,secret=xxxx
>> 
>> // set directory’s layout.pool_namespace
>> setfattr -n ceph.dir.pool_namespace -v foo_ns /mnt/ceph_mount/foo_dir
>> 
>> Admin user chooses namespace name. In most cases, namespace name does not change.
> 
> Good, I guess limiting it to 100 chars (or maybe even a smaller
> number) is sensible then.
> 
>> 
>>> 
>>>> 
>>>> 
>>>>> I don't like the idea of sharing namespace name strings between libceph
>>>>> and ceph modules, especially with the strings themselves hosted in
>>>>> libceph.  rbd has no use for namespaces, so libceph can live with
>>>>> namespace names embedded into ceph_osd_request by value or with
>>>>> a simple non-owning pointer, leaving reference counting to the outside
>>>>> modules, if one of the use cases is "one namespace with a long name for
>>>>> the entire directory tree" or something close to it.
>>>>> 
>>>>> I think the sharing infrastructure should be moved into cephfs, and
>>>>> probably changed to share entire file layouts along the way.  I don't
>>>>> know this code well enough to be sure, but it seems that by sharing
>>>>> file layouts and making ci->i_layout an owning pointer you might be
>>>>> able to piggy back on i_ceph_lock and considerably simplify the whole
>>>>> thing by dropping RCU and eliminating numerous get/put calls.
>>>> 
>>>> RBD may use namespace later.
>>>> http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_namespace_support
>>> 
>>> Well, compared to cephfs, it's hard to call that "using" - in that
>>> case, there will only ever be one namespace per image.  My point is
>>> it's never going to use the string sharing infrastructure and is fine
>>> with a non-owning pointer to a string in the file layout field of the
>>> in-memory rbd image header.
>>> 
>>>> 
>>>> The reason I use RCU here is that ci->i_layout.pool_ns can change at
>>>> any time. For the same reason, using non-owning pointer for namespace
>>>> or entire layout is unfeasible. Using embedded namespace is not
>>>> elegant either. When allocating ceph_osd_request, cephfs needs to
>>>> lock i_ceph_lock, copy namespace to a temporary buffer, unlock
>>>> i_ceph_lock, pass ci->i_layout and the temporary buffer to the
>>>> ceph_osdc_xxx_request().
>>> 
>>> Hmm, RCU doesn't protect you from pool_ns or other file layout fields
>>> changing while the OSD request is in flight.  As used above, it allows
>>> ceph_try_get_string() to not take any locks and that's it.
>> 
>> Yes.  But not taking lock simplify the code a lot.  we don't need to
>> lock/unlock i_ceph_lock each time i_layout is used.
> 
> I wouldn't call a bunch of rcu_dereference_* variants sprinkled around
> the code base a simplification, but, more importantly, is keeping the
> pool_ns pointer valid really all you need?  Shouldn't there be some
> kind of synchronization around "OK, I'm switching to a new layout for
> this inode"?  As it is, pool_ns is grabbed in ceph_osdc_new_request(),
> with two successive calls to ceph_osdc_new_request() potentially ending
> up with two different namespaces, e.g. ceph_uninline_data().

There is synchronisation. When changing file layout, MDS revokes Frw caps from client (block new read/write, for in-progress read/write). But this synchronisation is not complete reliable when client session state is toggled between stale and active.

>>> 
>>> Why exactly can't file layouts be shared?  ci->i_layout would become
>>> reference counted and you would give libceph a pointer to the pool_ns
>>> string (or entire layout) after bumping it.  It doesn't matter if
>>> pool_ns or the rest of the layout changes due to a cap grant or revoke
>>> while libceph is servicing the OSD request: you would unlink it from
>>> the tree but the bumped reference will keep the layout around, to be
>>> put in the OSD request completion callback or so.  Layout lookup would
>>> have to happen in exactly two places: when newing an inode and handling
>>> cap grant/revoke, in other places you would simply bump the count on
>>> the basis of already holding a valid pointer.  You wouldn't have to
>>> unlink in the destructor, so no hassle with kref_get_unless_zero() and
>>> no need for RCU, with i_ceph_lock providing the exclusion around the
>>> tree.
>> 
>> This means cephfs needs to set r_callback for all ceph_osd_request,
>> ceph_osd_request also needs a new field to store layout pointer.
>> I don’t think it’s better/simpler than reference counted namespace
>> string.
> 
> Not necessarily - you can put after ceph_osdc_wait_request() returns.
> Somewhat unrelated, I'm working on refactoring osdc's handle_reply(),
> and it'll probably be required that all OSD requests set one of the
> callbacks, except for stateless fire-and-forget ones.

For the r_callback case (no wait case), without saving a pointer in ceph_osd_request, how can I know which layout to put?

> 
> Sharing ->i_layout as opposed to ->i_layout->pool_ns seemed sensible to
> me because a) it naturally hangs off of ceph inode and b) logically,
> it is entire layouts and not just namespaces that are shared across the
> directory tree.  If you think reference counted pool_ns strings are
> better, I won't argue with that, but, with cephfs being the only user
> of either solution, it'll have to live in fs/ceph.

I’m OK with both approaches. When sharing i_layout, we need to add a layout pointer to ceph_osd_request. After adding the layout pointer, why not let libceph release it when request finishes. 

> Separately, I think passing non-owning pool_ns pointers into libceph is
> worth exploring, but if that doesn't easily map onto cephfs lifetime or
> ownership rules, we will go with embedding namespace names by value into
> ceph_osd_request (or, rather, ceph_object_locator).
> 

As I stated in previous mail, embedded namespace is nightmare for cephfs. Every time namespace is used,  cephfs needs to lock i_ceph_lock, copy namespace to a temporary buffer.

Regards
Yan, Zheng

> Thanks,
> 
>                Ilya

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html