On Fri, Nov 07, 2014 at 08:42:50AM -0500, Simo Sorce wrote: > On Fri, 7 Nov 2014 09:59:32 +0100 > Niels de Vos <ndevos@xxxxxxxxxx> wrote: > > > On Thu, Nov 06, 2014 at 05:32:53PM -0500, Simo Sorce wrote: > > > On Thu, 6 Nov 2014 22:02:29 +0100 > > > Niels de Vos <ndevos@xxxxxxxxxx> wrote: > > > > > > > On Thu, Nov 06, 2014 at 11:45:18PM +0530, Vijay Bellur wrote: > > > > > On 11/03/2014 08:12 PM, Jakub Hrozek wrote: > > > > > >On Mon, Nov 03, 2014 at 03:41:43PM +0100, Jakub Hrozek wrote: > > > > > >>On Mon, Nov 03, 2014 at 08:53:06AM -0500, Simo Sorce wrote: > > > > > >>>On Mon, 3 Nov 2014 13:57:08 +0100 > > > > > >>>Jakub Hrozek <jhrozek@xxxxxxxxxx> wrote: > > > > > >>> > > > > > >>>>Hi, > > > > > >>>> > > > > > >>>>we had short discussion on $SUBJECT with Simo on IRC > > > > > >>>>already, but there are multiple people involved from > > > > > >>>>multiple timezones, so I think a mailing list thread would > > > > > >>>>be better trackable. > > > > > >>>> > > > > > >>>>Can we add another memory cache file to SSSD, that would > > > > > >>>>track initgroups/getgrouplist results for the NSS > > > > > >>>>responder? I realize initgroups is a bit different > > > > > >>>>operation than getpw{uid,nam} and getgr{gid,nam} but what > > > > > >>>>if the new memcache was only used by the NSS responder and > > > > > >>>>at the same time invalidated when initgroups is initiated > > > > > >>>>by the PAM responder to ensure the memcache is up-to-date? > > > > > >>> > > > > > >>>Can you describe the use case before jumping into a proposed > > > > > >>>solution ? > > > > > >> > > > > > >>Many getgrouplist() or initgroups() calls in a quick > > > > > >>succession. One user is GlusterFS -- I'm not quite sure what > > > > > >>the reason is there, maybe Vijay can elaborate. > > > > > > > > > > > > > > > > GlusterFS server invokes getgrouplist() to identify gids > > > > > associated with an user on whose behalf a rpc request has been > > > > > sent over the wire. There is a gid caching layer in GlusterFS > > > > > and getgrouplist() does get called only if there is a gid cache > > > > > miss. In the worst case, getgrouplist() can be invoked for > > > > > every rpc request that GlusterFS receives and that seems to be > > > > > the case in a deployment where we found that sssd was being > > > > > busy. I am not certain about the sequence of operations that > > > > > can cause the cache to be missed. > > > > > > > > > > Adding Niels who is more familiar with the gid resolution & > > > > > caching features in GlusterFS. > > > > > > > > Just to add some background information on the getgrouplist(). > > > > GlusterFS uses several processes that can call getgrouplist(): > > > > - NFS-server, a single process per system > > > > - brick, a process per exported filesystem/directory, potentally > > > > several per system > > > > > > > > [Here, a Gluster environment has many systems (vm/physical). > > > > Each system normally runs the NFS-server, and a number of brick > > > > processes. The layout of the volume is important, but it is very > > > > common to have one or more distributed volumes that use multiple > > > > bricks on the same system (and many other systems).] > > > > > > > > The need for resolving the groups of a user comes in when users > > > > belong to many groups. The RPC protocols can not carry a huge > > > > list of groups, so the resolving can be done on the server side > > > > when the protocol hits its limits (> 16 for NFS, approx. > 93 for > > > > GlusterFS). > > > > > > > > Upon using a Gluster volume, certain operations are sent to all > > > > the bricks (i.e. some directory related operations). I can > > > > imagine that a network share which is used by many users, trigger > > > > many getgrouplist() calls in different brick processes at the > > > > (almost) same time. > > > > > > > > For reference, the usage of getgrouplist() in the brick process > > > > can be found here: > > > > - > > > > https://github.com/gluster/glusterfs/blob/master/xlators/protocol/server/src/server-helpers.c#L24 > > > > > > > > The gid_resolve() function get called in case the brick process > > > > should resolve the groups (and ignore the list of groups from the > > > > protocol). It uses the gidcache functions from a private library: > > > > - > > > > https://github.com/gluster/glusterfs/blob/master/libglusterfs/src/gidcache.h > > > > - > > > > https://github.com/gluster/glusterfs/blob/master/libglusterfs/src/gidcache.c > > > > > > > > The default time for the gidcache to expire is 2 seconds. Users > > > > should be able to configure this to 30 seconds (or anything else) > > > > with: > > > > > > > > # gluster volume set <VOLUME> server.gid-timeout 30 > > > > > > > > > > > > I think this should explain the use-case sufficiently, but let me > > > > know if there are any remaining questions. It might well be > > > > possible to make this code more sssd friendly. I'm sure that we > > > > as Gluster developers are open to any suggestions. > > > > > > > > > TBH this looks a little bit strange, other filesystems (as well as > > > the kernel) create a credentials token when a user first > > > authenticate and keep these credentials attached to the user > > > session for the duration. Why does GlusterFS keeps hammering the > > > system requesting the same information again and again ? > > > > The GlusterFS protocol itself is very much stateless, similar to > > NFSv3. We need all the groups of the user on the server-side (brick) > > to allow the backing filesystem (mostly XFS) perform the permission > > checking. In the current GlusterFS protocol, there is no user > > authentication. (Well, there has been work done on adding support for > > SSL, maybe that could be used for tracking sessions on a per-client, > > not user, basis.) > > > > Just for clarity, a GlusterFS client (like a fuse-mount, or the > > samba/vfs_glusterfs module) is used by many different users. The > > client builds the connection to the volume. After that, all users > > with access to the fuse-mount or samba-share are using the same > > client connection. > > > > By default the client sends a list of groups in each RPC request, and > > the server-side trusts the list the client provides. However, for > > environments where these lists are too small to hold all the groups, > > there is an option to do the group resolving on the server side. This > > is the "server.manage-gids" volume option, which acts very much like > > the "rpc.mountd --manage-gids" functionality for NFS. > > Instead of sending a list of groups every time ... wouldn't it be > better to send a "session token" (a random 128bit uuid) and let the > bricks use this value to associate their cached lists ? > > This way you can control how caching is done from the client side. Yes, I was hoping RPCSEC_GSS could help with that. But that is a major change and it'll take a while for it to get stable and used in deployments. Looking at it, there is a AUTH_SHORT option that we probably can use. We do not use AUTH_SYS, but some variation called AUTH_GLUSTERFS. In the end, they function pretty much the same. More on AUTH_SHORT: - http://tools.ietf.org/html/rfc5531#page-25 One of the difficulties would be to have all the bricks be aware of the token. There is no inter-brick communication... > > > Keep in mind that the use of getgrouplist() is an inherently costly > > > operation. Even adding caches, the system cannot cache for long > > > because it needs to return updated results eventually. Only the > > > application know when a user session terminates and/or the list > > > needs to be refreshed, so "caching" for this type of operation > > > should be done mostly on the application side. > > > > I assume that your "application side" here is the brick process that > > runs on the same system as sssd. As mentioned above, the brick > > processes do cache the result of getgrouplist(). It may well be > > possible that the default expiry of 2 seconds is too short for many > > environments. But users can change that timeout easily with the > > "server.gid-timeout" volume option. > > Well the problem is that, unless you know you have some sort of user > session, longer caches have the only effect of upsetting users that > have had their credentials just changed. > > The way it *should* work, at least if you want posix compatibility[1], > is that once a user, on a client, start a session, is that his > credentials never change until the user logs out and logs back in. > Regardless of what happens in the identity management system (or the > passwd/group files). Well, we would like to improve the Gluster behaviour, making it "posix complaint" at the same time works for me. I imagine that this would be possible when we use RPCSEC_GSS, just like NFS does. > > From my understanding of this thread, we (the Gluster Community) have > > two things to do: > > > > 1. Clearly document side-effects that can be caused by enabling the > > "server.manage-gids" option, and suggest increasing the > > "server.gid-timeout" value (maybe change the default?). > > > > 2. Think about improving the GlusterFS protocol(s) and introduce some > > kind of credentials token that is linked with the groups of a user. > > Token expiry should invalidate the group-cache. One option would be > > to use Kerberos like NFS (RPCSEC_GSS). > > Using RPCSEC_GSS is one good way to tie a user to its credentials, as > said credentials are tied to the GSS context and never changed until > the context is destroyed. Using, in general, a token created on > "session establishment"[2] and used until valid would resolve a host of > issues and make your filesystem more posix compliant and predictable > when it comes to access control decisions. The biggest advantage for the Gluster use-case seems to be that the token is valid on all the systems hosting a brick for a particular volume. At least, I hope that is the case. Because of the nature of the scale-out, scale-up filesystem, systems and bricks can get added whenever a sysadmin deems it necessary. I do not immediately see a solution to prevent your [*] footnote, that would require Gluster to pass credentials (and tokens?) around to all the bricks when they get online. It is not impossible, but requires quite some more work. > > Does this all make sense to others too? I'm adding gluster-devel@ to > > CC so that others can chime in and this topic won't be forgotton. > > It does. > > Simo. > > > > [1] IIRC Posix requires that the credentials set in the kernel at login > time are used throughout the lifetime of the process unchanged. This is > particularly important as a process may *intentionally* drop auxiliary > groups or even change its credentials set entirely (like root switching > to a different uid and arbitrary set of gids). > You may decide this is not something you want to care about for network > access, and in fact NFS + RPC_GSSSEC do es *not* do this, as it always > computes the credentials set on the server side at (GSSAPI) context > establishment time. Up to you to decide what semantics you want to > follow, but they should be at least predictable if at all possible. If the NFS + RPC_GSSSEC semantics are well understood, they should work for Gluster too. The main requirements would be that a userspace process can get the token for a user, and pass that on through a library call that then does the GlusterFS RPC stuff. Samba with vfs_glusterfs would be one of these users, glusterfs-fuse an other. > [2] You will have to define what this means for GlusterFS, I can see > only a few constraints to make it useful. > - the session needs to be initiated by a specific client > - you need a way to either pass the information that a new session is > being established or pass the credential set to the bricks > - you need to cache this session on the bricks side and you cannot > discard it at will (yes, this means state needs to be kept)* > - if a client connects randomly to multiple bricks, it means this cache > needs to be distributed and accessible to any brick anywhere that > needs the information > - if state cannot be kept then you have no other option but to always > re-transmit the whole credential token, as big as it may be (maximum > size on a linux system would be 256K at the moment (1 32bit uid + 65k > 32bit gids). Maybe we can ask for the whole credential token when a client connects to the brick for the first time, and after that use the session token. This would solve the issue I mentioned above about adding systems and bricks. > * the reason you do not want to let each brick resolve the groups is > that you may end up with different bricks having a different list of > groups a uid is member of. This would lead to nasty, very hard to debug > access issues that admins would hate you for :) Yes, that is a very good point. Thanks again, Niels _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-devel