Re: [PATCH v2 2/6] libceph: introduce reference counted string

Ilya Dryomov <idryomov@xxxxxxxxx> · Tue, 22 Mar 2016 17:13:02 +0100

On Tue, Mar 22, 2016 at 2:57 PM, Yan, Zheng <zyan@xxxxxxxxxx> wrote:
>
>> On Mar 22, 2016, at 19:00, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>>
>> On Tue, Mar 22, 2016 at 10:17 AM, Yan, Zheng <zyan@xxxxxxxxxx> wrote:
>>>
>>>> On Mar 22, 2016, at 14:05, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>>
>> [ snip ]
>>
>>>> Hi Zheng,
>>>>
>>>> A one and a half line commit message and an equally short cover letter
>>>> for a series such as this isn't enough.  I *happen* to know that the
>>>> basic use case for namespaces in cephfs is going to be restricting
>>>> users to different parts of the directory tree, with the enforcement
>>>> happening in ceph on the server side, as opposed to in ceph on the
>>>> client side, but I would appreciate some details on what the actual
>>>> namespace names are going to be, whether it's user input or not,
>>>> whether there are plans to use namespaces for anything else, etc.
>>>>
>>>
>>> The namespace restriction you mentioned is for cephfs metadata. This
>>> namespace is restricting users to different namespaces in cephfs data
>>> pool. (At present, the only way to restrict data access is, creating
>>> multiple cephfs data pools, restrict users to different data pool.
>>> Creating lost of pools is not efficient for the cluster)
>>
>> What about the namespace names, who is generating them, how long are
>> they going to be?  Please describe in detail how this is going to work
>> for both data and metadata pools.
>
> For example, to restrict user foo to directory /foo_dir
>
> // create auth caps for user foo.
> ceph auth get-or-create client.foo mon 'allow r' mds 'allow r, allow rw path=/foo_dir' osd 'allow rw pool=data namespace=foo_ns’
>
> // mount cephfs by user admin
> mount -t ceph 10.0.0.1:/ /mnt/ceph_mount -o name=admin,secret=xxxx
>
> // set directory’s layout.pool_namespace
> setfattr -n ceph.dir.pool_namespace -v foo_ns /mnt/ceph_mount/foo_dir
>
> Admin user chooses namespace name. In most cases, namespace name does not change.

Good, I guess limiting it to 100 chars (or maybe even a smaller
number) is sensible then.

>
>>
>>>
>>>
>>>> I don't like the idea of sharing namespace name strings between libceph
>>>> and ceph modules, especially with the strings themselves hosted in
>>>> libceph.  rbd has no use for namespaces, so libceph can live with
>>>> namespace names embedded into ceph_osd_request by value or with
>>>> a simple non-owning pointer, leaving reference counting to the outside
>>>> modules, if one of the use cases is "one namespace with a long name for
>>>> the entire directory tree" or something close to it.
>>>>
>>>> I think the sharing infrastructure should be moved into cephfs, and
>>>> probably changed to share entire file layouts along the way.  I don't
>>>> know this code well enough to be sure, but it seems that by sharing
>>>> file layouts and making ci->i_layout an owning pointer you might be
>>>> able to piggy back on i_ceph_lock and considerably simplify the whole
>>>> thing by dropping RCU and eliminating numerous get/put calls.
>>>
>>> RBD may use namespace later.
>>> http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_namespace_support
>>
>> Well, compared to cephfs, it's hard to call that "using" - in that
>> case, there will only ever be one namespace per image.  My point is
>> it's never going to use the string sharing infrastructure and is fine
>> with a non-owning pointer to a string in the file layout field of the
>> in-memory rbd image header.
>>
>>>
>>> The reason I use RCU here is that ci->i_layout.pool_ns can change at
>>> any time. For the same reason, using non-owning pointer for namespace
>>> or entire layout is unfeasible. Using embedded namespace is not
>>> elegant either. When allocating ceph_osd_request, cephfs needs to
>>> lock i_ceph_lock, copy namespace to a temporary buffer, unlock
>>> i_ceph_lock, pass ci->i_layout and the temporary buffer to the
>>> ceph_osdc_xxx_request().
>>
>> Hmm, RCU doesn't protect you from pool_ns or other file layout fields
>> changing while the OSD request is in flight.  As used above, it allows
>> ceph_try_get_string() to not take any locks and that's it.
>
> Yes.  But not taking lock simplify the code a lot.  we don't need to
> lock/unlock i_ceph_lock each time i_layout is used.

I wouldn't call a bunch of rcu_dereference_* variants sprinkled around
the code base a simplification, but, more importantly, is keeping the
pool_ns pointer valid really all you need?  Shouldn't there be some
kind of synchronization around "OK, I'm switching to a new layout for
this inode"?  As it is, pool_ns is grabbed in ceph_osdc_new_request(),
with two successive calls to ceph_osdc_new_request() potentially ending
up with two different namespaces, e.g. ceph_uninline_data().

>
>>
>> Why exactly can't file layouts be shared?  ci->i_layout would become
>> reference counted and you would give libceph a pointer to the pool_ns
>> string (or entire layout) after bumping it.  It doesn't matter if
>> pool_ns or the rest of the layout changes due to a cap grant or revoke
>> while libceph is servicing the OSD request: you would unlink it from
>> the tree but the bumped reference will keep the layout around, to be
>> put in the OSD request completion callback or so.  Layout lookup would
>> have to happen in exactly two places: when newing an inode and handling
>> cap grant/revoke, in other places you would simply bump the count on
>> the basis of already holding a valid pointer.  You wouldn't have to
>> unlink in the destructor, so no hassle with kref_get_unless_zero() and
>> no need for RCU, with i_ceph_lock providing the exclusion around the
>> tree.
>
> This means cephfs needs to set r_callback for all ceph_osd_request,
> ceph_osd_request also needs a new field to store layout pointer.
> I don’t think it’s better/simpler than reference counted namespace
> string.

Not necessarily - you can put after ceph_osdc_wait_request() returns.
Somewhat unrelated, I'm working on refactoring osdc's handle_reply(),
and it'll probably be required that all OSD requests set one of the
callbacks, except for stateless fire-and-forget ones.

Sharing ->i_layout as opposed to ->i_layout->pool_ns seemed sensible to
me because a) it naturally hangs off of ceph inode and b) logically,
it is entire layouts and not just namespaces that are shared across the
directory tree.  If you think reference counted pool_ns strings are
better, I won't argue with that, but, with cephfs being the only user
of either solution, it'll have to live in fs/ceph.

Separately, I think passing non-owning pool_ns pointers into libceph is
worth exploring, but if that doesn't easily map onto cephfs lifetime or
ownership rules, we will go with embedding namespace names by value into
ceph_osd_request (or, rather, ceph_object_locator).

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html