On (07/11/14 10:13), Jakub Hrozek wrote: >On Fri, Nov 07, 2014 at 09:59:32AM +0100, Niels de Vos wrote: >> On Thu, Nov 06, 2014 at 05:32:53PM -0500, Simo Sorce wrote: >> > On Thu, 6 Nov 2014 22:02:29 +0100 >> > Niels de Vos <ndevos@xxxxxxxxxx> wrote: >> > >> > > On Thu, Nov 06, 2014 at 11:45:18PM +0530, Vijay Bellur wrote: >> > > > On 11/03/2014 08:12 PM, Jakub Hrozek wrote: >> > > > >On Mon, Nov 03, 2014 at 03:41:43PM +0100, Jakub Hrozek wrote: >> > > > >>On Mon, Nov 03, 2014 at 08:53:06AM -0500, Simo Sorce wrote: >> > > > >>>On Mon, 3 Nov 2014 13:57:08 +0100 >> > > > >>>Jakub Hrozek <jhrozek@xxxxxxxxxx> wrote: >> > > > >>> >> > > > >>>>Hi, >> > > > >>>> >> > > > >>>>we had short discussion on $SUBJECT with Simo on IRC already, >> > > > >>>>but there are multiple people involved from multiple timezones, >> > > > >>>>so I think a mailing list thread would be better trackable. >> > > > >>>> >> > > > >>>>Can we add another memory cache file to SSSD, that would track >> > > > >>>>initgroups/getgrouplist results for the NSS responder? I realize >> > > > >>>>initgroups is a bit different operation than getpw{uid,nam} and >> > > > >>>>getgr{gid,nam} but what if the new memcache was only used by >> > > > >>>>the NSS responder and at the same time invalidated when >> > > > >>>>initgroups is initiated by the PAM responder to ensure the >> > > > >>>>memcache is up-to-date? >> > > > >>> >> > > > >>>Can you describe the use case before jumping into a proposed >> > > > >>>solution ? >> > > > >> >> > > > >>Many getgrouplist() or initgroups() calls in a quick succession. >> > > > >>One user is GlusterFS -- I'm not quite sure what the reason is >> > > > >>there, maybe Vijay can elaborate. >> > > > > >> > > > >> > > > GlusterFS server invokes getgrouplist() to identify gids associated >> > > > with an user on whose behalf a rpc request has been sent over the >> > > > wire. There is a gid caching layer in GlusterFS and getgrouplist() >> > > > does get called only if there is a gid cache miss. In the worst >> > > > case, getgrouplist() can be invoked for every rpc request that >> > > > GlusterFS receives and that seems to be the case in a deployment >> > > > where we found that sssd was being busy. I am not certain about the >> > > > sequence of operations that can cause the cache to be missed. >> > > > >> > > > Adding Niels who is more familiar with the gid resolution & caching >> > > > features in GlusterFS. >> > > >> > > Just to add some background information on the getgrouplist(). >> > > GlusterFS uses several processes that can call getgrouplist(): >> > > - NFS-server, a single process per system >> > > - brick, a process per exported filesystem/directory, potentally >> > > several per system >> > > >> > > [Here, a Gluster environment has many systems (vm/physical). Each >> > > system normally runs the NFS-server, and a number of brick >> > > processes. The layout of the volume is important, but it is very >> > > common to have one or more distributed volumes that use multiple >> > > bricks on the same system (and many other systems).] >> > > >> > > The need for resolving the groups of a user comes in when users belong >> > > to many groups. The RPC protocols can not carry a huge list of groups, >> > > so the resolving can be done on the server side when the protocol hits >> > > its limits (> 16 for NFS, approx. > 93 for GlusterFS). >> > > >> > > Upon using a Gluster volume, certain operations are sent to all the >> > > bricks (i.e. some directory related operations). I can imagine that >> > > a network share which is used by many users, trigger many >> > > getgrouplist() calls in different brick processes at the (almost) >> > > same time. >> > > >> > > For reference, the usage of getgrouplist() in the brick process can be >> > > found here: >> > > - >> > > https://github.com/gluster/glusterfs/blob/master/xlators/protocol/server/src/server-helpers.c#L24 >> > > >> > > The gid_resolve() function get called in case the brick process should >> > > resolve the groups (and ignore the list of groups from the protocol). >> > > It uses the gidcache functions from a private library: >> > > - >> > > https://github.com/gluster/glusterfs/blob/master/libglusterfs/src/gidcache.h >> > > - >> > > https://github.com/gluster/glusterfs/blob/master/libglusterfs/src/gidcache.c >> > > >> > > The default time for the gidcache to expire is 2 seconds. Users should >> > > be able to configure this to 30 seconds (or anything else) with: >> > > >> > > # gluster volume set <VOLUME> server.gid-timeout 30 >> > > >> > > >> > > I think this should explain the use-case sufficiently, but let me know >> > > if there are any remaining questions. It might well be possible to >> > > make this code more sssd friendly. I'm sure that we as Gluster >> > > developers are open to any suggestions. >> > >> > >> > TBH this looks a little bit strange, other filesystems (as well as the >> > kernel) create a credentials token when a user first authenticate and >> > keep these credentials attached to the user session for the duration. >> > Why does GlusterFS keeps hammering the system requesting the same >> > information again and again ? >> >> The GlusterFS protocol itself is very much stateless, similar to NFSv3. >> We need all the groups of the user on the server-side (brick) to allow >> the backing filesystem (mostly XFS) perform the permission checking. In >> the current GlusterFS protocol, there is no user authentication. (Well, >> there has been work done on adding support for SSL, maybe that could be >> used for tracking sessions on a per-client, not user, basis.) >> >> Just for clarity, a GlusterFS client (like a fuse-mount, or the >> samba/vfs_glusterfs module) is used by many different users. The client >> builds the connection to the volume. After that, all users with access >> to the fuse-mount or samba-share are using the same client connection. >> >> By default the client sends a list of groups in each RPC request, and >> the server-side trusts the list the client provides. However, for >> environments where these lists are too small to hold all the groups, >> there is an option to do the group resolving on the server side. This is >> the "server.manage-gids" volume option, which acts very much like the >> "rpc.mountd --manage-gids" functionality for NFS. >> >> > Keep in mind that the use of getgrouplist() is an inherently costly >> > operation. Even adding caches, the system cannot cache for long because >> > it needs to return updated results eventually. Only the application >> > know when a user session terminates and/or the list needs to be >> > refreshed, so "caching" for this type of operation should be done >> > mostly on the application side. >> >> I assume that your "application side" here is the brick process that >> runs on the same system as sssd. As mentioned above, the brick processes >> do cache the result of getgrouplist(). It may well be possible that the >> default expiry of 2 seconds is too short for many environments. But >> users can change that timeout easily with the "server.gid-timeout" >> volume option. > >I guess that might be a viable option to work around the problem for the >user who initially reported it, but it also doesn't align with what I >saw in the logs..the sssd_nss logs showed 4000 initgroup requests over >two minutes from maybe about 10 users.. > >> >> From my understanding of this thread, we (the Gluster Community) have >> two things to do: >> >> 1. Clearly document side-effects that can be caused by enabling the >> "server.manage-gids" option, and suggest increasing the >> "server.gid-timeout" value (maybe change the default?). >> >> 2. Think about improving the GlusterFS protocol(s) and introduce some >> kind of credentials token that is linked with the groups of a user. >> Token expiry should invalidate the group-cache. One option would be >> to use Kerberos like NFS (RPCSEC_GSS). >> >> >> Does this all make sense to others too? I'm adding gluster-devel@ to CC >> so that others can chime in and this topic won't be forgotton. >> >> Thanks, >> Niels > >And on the SSSD side, we need to think about an initgroups cache. So far >I filed ticket https://fedorahosted.org/sssd/ticket/2485 listing the two >options Simo outlined earlier. > >GlusterFS is not the only project that requested faster initgroups >caching, Alexander's slapi-nis would also benefit from the new cache >(Although with slapi-nis we also have a bit conflicting RFE to stop >using NSS interfaces and go to SSSD directly, but that's something for >us to solve..) memory cache is used in nss responder as well :-) LS _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-devel